173 *33 



TITLE 



IliSTIgOTlON 



DOCDBENT BESDMB 



/ TM 009 570 



Proceadinqs of th^ Inyltatlonal Conference on Testlog 
Probrams ne'thv N«s-w Yorlc', New York, November t, 
19S2) • 



PUB DATl' 
NOTE 

IDBS PRICE 
DESCRIPTORS 



IDEIiTTFI EBS 



1 Nov 
135p, 



52 



■1. 



HFOI/PCO.6 Plus Ppstaael * 
Culture Pr^g Tests; ^ Elv^ctionsi ^Intelfigence T^sts; 
*Motiv^tion| *Norm R^f^ranced Tests,; prediction; 
Public Opinion* Student Testing; *Surv^ysi *T#^t 
Bias; Test Construction; ^Testinq 'Problpnis; ^T^st 
Interpretation . , \ . ^ . 

Semantic T^st of Intelli ;j^nc^ ^ . 



ABSTBACT . ' ' 

Four topics wsre 9 mpKaaizad i urinq this , conference on 
testinq probl^sms: (1) tha s^l^c^idn of appropriate scora scalej for 
t^sts; (2)_ the experimpnt.al approach to zh^ mrasur^m^nt Qf huraan^ 
moti¥atioai (3) trends in public opinign pollinq since 19Ua and thai^ 
probable effects on prediction so f th^ 1^52 election; and (aj 
tichniqu^s for divalopinq unbiased, inteiliqence tests* Eric F. 
Gardner and Ledyard R,. Ticker botl>* discusaed th? importance! of J _ 
rtf^r^nc*^ qroups in scalinq procedure; discussions by John C- 
Flanaqan and J.. Lindqui^t f oirow^d. D ivid C, McClejLU^d presents d 
tli« address cn^the measurenKnt ot human [flo^ivaticn; a pansl 
discussion followed* Tha . lunclicon symposi'im tocus^d on th^ trends in' 
public opinion pollinq sinco V^^S, and n ri^?^ pr 9dict ion of the 1952^ 
^3lmctionm^ Sawplinq^ int^^rv i^tfinq, and d ita analysis werp discussea^ 
by Frederick F. Stephana H^rb^^rt niiaian^ ^^^^ Samu^jl Stouff ^r^ 
'respectively* The session on unbiased intelliqence , tests includad ^ . 
papers' by Irvinq Lorq^?^ Phillip Pulon, aid Ernest A. Haqgard; 
Qulnn McN^mar and Ernnst A, Haqqard commented on th* papers, (An 
eieample c£ Phillip auion«s s ^fnan tic T ost of Int€lliqenc^^--STI--is 
included.) (GDC)\ ' 



^ Heproductions supplied by EDRS ara the b^rs-^ that can be ,made * 

-* / from th~ original document* 

********** 1^3^***** 



%DUC. 



lATIONAL TESTING SERVICE 

BOARD 01 TBtJSTBES 



Katiiarine E. McBride, Chairman ) 

. Arthur S. Adrais Henry Hill 

^ . Raymofld B, iUIen Herold C,/^unt : 

-^ a \ Joseph W. Barker Lewis W jbnei , 

Frank H. Bowles Thonias McConnell 

^ ' Oliver a Cannichael Lester W. Nelson 

Charles W, Cole fidwwdS. Noyes 

Tames B. Conant George Stoddard 

[ : , ' *i * OFFICERS ^ , 

' ' ■ V , Henry Chauncey, Frmiimt 

- ' f t . Richard H. Sullivan, Vice President, and Trmhm 

, * WilUam W/Tumbull, Fice Preri|^^ 

' ' • ' Jack^K, Rimalover, Secregafy 

Catherine G. Shaip, A§mtant. Secretarjj;^^ 
Robert KolkebeckM^^i^ffl^* ^^e^^ 



COPYRIGHT*. 1953. EDUCATIONAL TESTING SERVICp ^ 
20 NASSAU STREET. .PRINCETON, N. h ' 
PRINTED IN THE UNrTED STATES OF AMERICA ' 



% 



V - 



INVITATIONAL 
CONFERENCE 

/ ■ . ON ' ■ 

jESTINt; PROBLEMS 




NOVEMBER41^^1952 



GEOnGE' K: BENmiT, Chmrman ? 

5f Selecting Appropriate Sdo^e, Scales foi)Tests 

T[ The l^asurement of Humao Motivation: ^ 
.\ An Hxperimental Appr0ffch . ^ ^ ^ 

If Trends in Public Opinion PoM^ng Since 1948 and Their Probable 
Effect on 1952 Election Predictions ' 

1[ Techhiques for the Developinent of Unbiased Tests 




ONAL TESTING SERVICE 



PRINCETON, NEW JEI^EY 



LOS ANGELES, CALIFORNIA 



t'ilij 



ERIC 



FO] 



Tm 1952 Invitational ConfMHH|; Testing Problems 
marked sixtaentii year ^T^^A^^^ence and Uie flfth 
ye^ it has been sponsored by E^Hmal Testing Service, 
Wi|Ji the ttansfer of Ae testfcg ao^Res of die American 
Council on Eduration to newlyAimed Educational Test- 
ing S^Ice in early IMS, it had seffl^ appropriate to trans- 
fer also die sponsorship ofi die d^^| annual Confermce 



Under the Council s cipafte guid 
wn from a small gr 



the Invitational Con- 
5to almost two hundred 



ference had grown 
interested participants 

,Th|i year &e number of persons attending readied a new 
high,. almost 100 more dian th^^evious year and more tihan u 
dbubie that of five years ago, Itis would seem to be attribut- 
able not only to die growto of interest in m^sur ement prob- 
lems generaUvy but also to the particulsfr appei^ of Ae 
prbgrara^swd3ged by iGhairman Ge^^ 

In VplMining his prograiWDr* Bennett reached coast to 
coast for the most aile men to disc^^ tHe topics spheduled* 
For die ^cheon ayiaposium he arranged a profflram closely 
related to the natipnal election, which followed the Invita- 
tional eonfererice by diree days^ His efforts resulted in a / 
meeting of peat prof essional value and intellectual stimula-/ 
tiohj on© welt befl^^^ of successM Invita^ 

tional Conferendes. * v 

To GeWge Bennett arfd those ;participants wlo made this 
year's Conference so sipiificantly/fijfacfessful llwaht to ex- 
press my deep appreciation for a job well done, 

^ / HliraY C^IAUNCEY 

5 ^ Pf^men* 



Tm pape^ and cB^ssioni of ^© 1852 bwtalipnal Conferenca on 
Tasting Pr^l^ms sponsored by Educational Testing Semce are ^r-; 
ipanenfly r^c^rded on die p^ges ^at follow. The Conference^ held 
Nbvmiber 1, 1982, mt^m Roosevelt HeaynUe^^ York Oty, attraeted . 
mort Am 400 individuals. Hiere were lour sectlDns in (die progtain, a 
monung panel diseussmg "Selecting Appropriate Score Scales for 
Tests,^ an address by Dr. Oavid C. McClelland on "The Measutemeiit 
of Human Motivations An Ei^rimental Approach " a luncheon sy^' , 
poiium on Trends in .RibUc ppinion Polling Since 1948 and Hiw, 
Probable Effect on 1982 Electidu ftredictions/' and an afternoon panel 
on T'e^nfques for Development of Unbiased Tests/* 

As in past yeairt &e topics selected have been tiiose regarded as 
toely in inter^t and important in psychomettic implications. The 
qu^tyof die audience af these meetings is such ai to stimulate ^ak^ ^ 
era to miie cwefully prep^ed^nd logicd preientations of their points 

of view* ' ^ ; 

It is felt iat die papers presented on tiiis occasion have maintained 
the high level est^Uahed by previous particip^ants. It does not seem 
appropriate for the Chairman of diiy ession to comment furflief upon 
^e topics eonsidered, but it is seemly for him to express his grati^de * 
to Educational Testing Service for die privilege of presiding as well as. 
his confidence ttiat the InvltationarCtonference wiU^exert a beneficial 
influence upon measurement in education and psychology for many 
years to come. ' 

Geoege K. Bennett, Ckaibman 
' J 1952 Cqbference 
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Selecti ng Appropriate S^re Scales for Tests r 

ER^ F". GARDNER 



THE IMPORTANeE OF REFERENCE GROWS IN 
- , SCAUNG PROCEDURE 

It is cominonly accepted tiiat a single feolated test score is of little or 
no value. For a score to have meaning md be of social or scientific, 
utility, some sort of fMme'of reference is needed. A number of dif- ^ 
. ferent frames of reference have been pro;^ed and been found to 
1. have value. Irf view of the fact that Jthis ses^n is devbted to a con- 
sideration of Ihe ^ling ^ tests imth and without emphasis on a 
refereftce population, it/is tiie purpose of this ^per to present some 
6f the i^Qre comiflon scaling mqjthods and to comment on the role 
* played by the underlying population. ^ 

^ Role of Population ij? Scaling Test Scores 
A famili^ frame of reference is provided by^ the performance of 

^ ^ individu^s In a single welUdefined gropp on a'particular test at a 
particular time. Two commonly used types of scales have been de- 

* rived wltiiin such ^ frmie of reference. The simplest are ordinal scales 
such as percentile scores in which the scale number describes relative 
position in a group. TTie simplicity of percentile scores is also their 

. limitationi they do not Have algebraic utility/ The second type are 
interval scales where an effort has been made to obtain algebraic 

' utility by definition. The T-scores of McCall represent an interval 
scale where equal .u^its have been deflned as equal distances along 
Ae abscissa .of a postulated normal population frequency disWbution, 
A sSbond type of frame of reference is provided by the test per- 
formahce 6f ind^duals belonging to well-defined subgroups where 
the subgroups have a spedfio relationship to each otiier within the 
composite group. Witfii^i this frame of reference both ordinal and 
interval scales'^ have been derived. Initially the basic problem is to 
obtain ordinally related subgroups such as grades 1 to 9 or age groups 
from a specified population for the scaling operation* Age scores md 
grade scores provide ordinal scales which have had wide utility in 
tfig elementary grades. Attempts have been made to obtain the merits 
#of an algebraicdly^ ^anipulatable scale by utilizing ordinal relation- 

^ [13] ^ . ^ 
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ship of subgroups but introducing restrlcUons in terms of the»ihape i 
of the frequency diaWbutions. Efforts to' obtain interval scales widiin 
such frames of reference hav© been made by Klanagan in the develop- 

Iid^^SiKTOf^(I)^f^"to^Coop^^ 
speaker in tiie development of K-Scores (2). Cooperative Scaled Sc<^es v 
are based on Ae assumption of overlapping normfd distribution of 
ability groups md K-scores on tiie assumption that overlapping grad| 
distributions can bp represented by Pearson Type III Curves, 

'Jlie impbrtanca of tiie particular referetice p^ulation which is used 
to deterinine uy tpch scales cannot be overemphasized, A person 
scoring at the eighty fourfli percentile or obtaining a T-score of 60 
in an arithmetic test where &e score is calculated for a typic^ seventh 
grade is obviously not pe^ormiiig equally to one whose^tknding at 
the eighty fourth percentilt^T the SMie test is calculated for a below- 
^^verage seventh grade. Likewise a pupil witfi a vocabiilary grade score 
' of 5.2 obtained from a representative sample of fifth graders in, say 
* Mississippi^ is^certainly not comparable to a pupil making a score of 
5,2 based on a national representative sample* The importance of tiie 
particular population in determining the fundamental reference point 
and size of unit is stressed by the originators of both Cooperative 
Scaled Scores^ and KfSeores. The ratio between the variabilities of 
overlapping groups in b^i Sealed Scores and K-scores ii a function 
of the areas cut off in samples ot the overlapping groups by the same 
points in each of the overlapping distributionsi Hence this important 
characteristic of the basic units in each type of scale depends upon 
the particular sample selected since i^^is highly probable that over- 
lapping distributions select^ from different j^opulations will have 
different amounts of overlap at points along the scale J 

Psychophysical scaling procedures are also sometimes applied to 
achievement testing. It is to be noted that resulting scales such as 
sensed difference units which are Jbased on Just-noticeable-differences 
or equjly-often" noted-differences are a function not only oj^he pupils 
tested but also of tiie stople of persons making the required Judgments, 

Properties OF Scale DpflSDENT|UPON Pot 

Test scores are used' by administratorSj teachers and research work- 
ers to make comparisons in terms of rank, level of developmeiit, growth 
and trait differences afnong both individuals and groups. Hence many 
types of scales/ have been developed depending upon the intended 
use. Each is qonsisWnt within itself but the properties of the sesJes 

[14] ■ . ' 
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are not completely comsi^ent from one type of scala to anodier. Wot ^ • 

Sample a grade soJe Is not appropriatB for^ measuring growtii in a 

functiDn unless one isi willing to acoept* the assumption tffat jrov^ is 

l inearly r^l rte J t o gra dgrfrseeres^fai^^re-designed^to^py^^ 

Interval, scale for measuring growth duripg Ae elemeptary se^ol 

witon '^'part^ulay schoolTuBjeet ate not domj^able from one school ■ . 

subJeBt to another unless one is wUllng to assume a'coHtmon growth 

for aU tiie subjecls being compared. FurUiermore the adoption of a 

uniform stapdard deviation of 7 K-units for fifth grade distributions 

fieflnes as equal die varlahiUty of fifth gradewrformaiioe in alliunc- . 

tions. TTie.s^ffii'of the Binet items involves W assumption of a linear *. 

relationship between Mental Age aiid Chronological Age. As valuable 

and useful as the Binet Scale has been for the purpose for which it ' i 

was designed, it has obvious Umitations when we try to infer the 

%iie'' nature of intellectual'growth. . ■ ' " _ 

' ■ ' Sqaling Stabile ni Labqb RipmiBNTATivB PopuLATiONa ' . _ 

Scales derive their, propertlfs in two ways-by defciltion and ex-* 
perimerital verification. Using K-scores as an example let us,consider 
two' desirable prbperties^f m scale: (1) that it shall beanvariant with , 
respect to tiie sampWof items.used and (2) that it be invariant with 
respect tb the popdatlon used In its derivation. The first property is 
inherent in the definition of K-soores ani in the specific definitions of ^ 
otfier srtres such as Cooperative Scaled Scores." That is, since K-scores, , - ^ 
are defined by tibe amouiit of overlap beKveen adjacent grade dis- , . 
trlbutions any test of a function that will reliably rank tiie scaling 
le In the same way will give rise^tb exactly, the same set of 



K-scores., ' \ r . i » v. t 

for example, the K-scores obtaifted from Stanford Achievement 
Word Meaning test data would be Identical to the K-scores obtained 
from die Metropolitan Vocabulary test data provided all children; in 
the grade r^ge scaled were ranked in the same order by both t^ts. 

"The second property meiftioned is not necessarily inherent In 
K-scores in terms of their derivation. With suitable attention to 
sampling problems it is reasonable to expect to obtain scdes with 
reproducible properties from one sample to another. In conteast such^ 
reproducibility is not expected from pop^lafion to population. ,There 
are, however; practical situations in which it would be useful to have 
a scale whi^ .was invariant with respect to more than a single 
■population. For example,' since achievement tests are used for measur- 
es ] • 
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ing grawth md comparing tiie performance- of groups over a long 
period of time (8 to 10 yearsy it would be desirable to have a seald 
'which would be invaritnt with respect to national^ samples takin 
*%nnually. Any such property of K-sccHres*or any other scale must be 
established on m empirical br exj^rimental basis. Such stability from* 
one population to another is e^denced in recent efforts to apply 
K-score scaling to the forthcoming^editiQn of the Stanford Achieve^ 
ment tests, . , ^ 

GradQ means, diflferences Jn grade means, grade stahdard deviations 
and grade skewnjesses expressed in' K-units determined from th& 

f erfprmance of the national formative sample obtained in 1952 on- 
'orm J pf the forthcoming revision of die Stanford Achievement Test 
are cqmpared in Table I with thf corresponding statistics -eicpressed 
in K-units determined from the 1940 national normative sample on 
Fprni D of the Stanford Achievement Test ' 

A K-unit IS deflnid as one-seventh the standard =deviati» of the 
national grade 5 frequency distribution in any traft where Pearson 
Type III Curves have been fitted to it and to the adjacent grades in 
such a ^ way that the proportion of cases in each grade exceeding each 
raw score is the same as that found in the original data. The mean 
perfonnance of children in the United States after conftpleting the 
ninth grade was selected as the reference point and assigned a K-score 
of 10O; ^ , 

Ttie 1940 sample in terms of which tlie 1948 K-units were defined 
consisted of approximately 50,000 cases and was itself a twenty per- 
cent random saniple selected from about 300,000 pupils to whom the 
Stanford Achievemeht Test Form D was administefed at the end of 
the school year in 1940, The sample appeared representative of the 
national elementary^school population with respect to sex^ LQ., age 
and geographical location. 

The sample in terms of which the present (1952) K-units for arith- 
metic reasoning were defined consists of approximately 94,000 cases 
and was selected from a sample of about 460,000 pupils to whom the 
new Stafford Achievement Test Form J was administered in April 
^nd May, 1952. Communities were selected to give a representative 
national sample in terms of size and geographical location according 
to the United States census. All pupils in at least three consecutive 
grades in those communities were tested and a twenty percent sample 
of these testees was taken at random from each class tested within 
those communities* , ' ^ ^ 

[16] 
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In order to compare Ae results obtained when K-units in arithmetic 
reasoning were derived independently from e^h population let us 
now e^aminei (1) the average growth in arithmetic reasoning from 
grade to grade; (2) the exfent to which the variability in aritiimetlc 
reasoning changes as children prpgress through the grades; (3) Uie 
effect of progress on the skewness of the grade distributions arid (4) 
whedier the fitted curves approximate normal curves, 

Altiiough it is commonly believed that growth in specific subjects 
in the elementary and Junior high schools is not constant from grade 
to gmde^ the objective verification of this belief has been^difflcult due 
to lack of an interv^ scale extending over the range of grades. Hie 
differences in mean achievement (in terms of K-scores) of successive 
grades pf die 1940 sample and the 1952 sample in arithmetic reasoning 
are given respectively in the fourth and fifth columns of Table L These 
differences are indicative approximately of tiie amount of growth in 
the trkit measured in the particular grade listed.* , \. 

The relative change in variability of the performance of children in 
successive grades has also been difBcult to determine, due to die lack 
of an interval scale extending over the range of grades. The standard 
deviation in terms of K-units oC^ach grade in arithmetic reasoning for 
tfie M40 sample and the 1952 sample are given in columns six and 
seven of Table 1. 

One of the major findings presented in a paper given at ttie 1948 
Invitational Testing Conference was the consist eiit increaae in varia- 
bility in two arithmetic traits from the second grade to the ninth in 
contrast with two verbal traits in which die standard deviations were 
nmrly constant. The present study supports the previous finding con- 
cerning increased variability from grade to grade in arithmetic rea- / 
soning. The standard deviation in grade 2 is 3,3 K-units, while in gra4^^ 
9 it has increased to 18,1 K-units. Thus one of the several implications 
one can draw is that as children progress through the grades the 
prdblems of^ the arithmetic teacher increase in that the groups be- 
come more heterogeneous. 

The skewness of each grade for each sample is given by columns 
eight and nine. In the 1948 paper no consistent skewness trends com- 



* Howeverj since the people in each grade were different from those in other 
grades » these differences in grade means may be considered as gro\vth only to die 
extent that we are willing to eonsidgri for example, the present third graders as 
comparable to what the second graders will . become a year hence* True growth 
could be determined by measuring the same people with comparable instruments 
in terms of K-units at different grade levels as they progress through sehooL 

[17] 
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parable to those observed for grade standard deviations were evl^^^, 
denoed. In the present situation there^oes appear to be an increase 
in skewness* from grade to grade with a single reversal between grades ^ 
2 and 3. These data coupled with the previously reported data (3) 
would loaione to believe tiiat the assumption of normality for every 
grade dl^bu'iion is not as tenable a hypothesis as the assumption 
that the gr^ distributions are skewed. 

' The data in Table 11 which were published In the Proceedings of 
the 1948 Invitational Conference on Testing Problems have been In- 
cluded to show the contrasting results obtained between arlthmedo 
functions (arithmetlG^ reasoning) and a second function (paragraph 
meaning) when measurements are made In terms of K-scores. 

Considering the facts that different tests, were used, and also samples _ 
from different populations reflecting the lapse of a twelve year period 
which Included World War II with resulting dislocations of pupils 
and teachers and many curriculum changes, It seems to the author ' 
that discrepandies in the pattern of differences in grade means and 
grade variabilities in the two sets of arithmetic reasoning data are 
minor compared with the general pattern of agreement. 

Role of Population in ScALiNa Individual Items 
The problems Involved In the scaling of indlvklual test Items are 
similar to those of scaling test scores In that anUtem may be con- 
sidered as a test whloh represents a smaller sampfe of behavior than 
the total test score. One of the most wicfely used scales Ln which in- 
dividual items were scaled is the Termnn-Merrill scale for the Stan- 
ford Blnet (4). Items were located on this scale as a result of the per- 
formance of well-selected age groups. Iii his recently developed latent 
structiire analysis Lazarsfeld (5) has p'nfesented scaling methods which 
Involve the assumption of a polynomial trace line for each Item. Hie 
responses of the sample of people to the Item are used to determine 
the parameters necessary to define the scale. In all cases the empirical 
data which define the scale are dependent upon the reference popula- 
tion used. • J 1 u 

In some instances scaling based on total tost score Is preceded by a 
scaling or partial scaling of |ems. In the Stanford Achievei^ent Tests 
difficulty Indices for each itim were computed for weri-deBned and 
well-described grade groups. A test composed of these items was then 
administered to a national sample of each grade group and various 
types oPIcales based upon the total score were obtained. 

[IS] 
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General Considerations ' 

In concluslqn ttus paper has attempted to achieve two dbjeotives 
(1) to revie:^ some of the more copimon scaling techniques and 
emphasize tiie importance of the role of die referent population as 
a background lor the second paper which breats tfie topio% staling 
techniques * which iminlmize Ae, reference * population and (2) to il- 
lustrate that stable results ^m^^ obtained wifli different large refer- 
ence populations as shown by an empirical study on the comparability 
of aritlmetic reasoning K-scaJes base^, on two national samples of 
elementary school children t^en 12 yea^s ap^t and. obtained from * 
two distinct though similar iHSfruments, 

It is to be noted that not only in the argument o^^is paper but in 
the development of ^-scores (our major illustration) the 'reference 
populatioris have assumed major and fundamental importal^ce. The 
acceptance of comparable scales utilizing different methods and/or 
different populations is dependent jjpon empirical verifldation, 

Siniations where there is internd consistency within a number of 
frames of reference but inconsistency of properties from one frame of 
reference to another are not unique to scaling. There are excellent 
analogies in ^e field of Geometry* ^e geometries of Et^lid, Rie^tann, 
and Lobachevsky, each one of which is based on a|different postulate 
about parallel lines, are consistent intem^ly but have certain prppertiei 
which are inconsistent from one geome|ry to anotiier. Each of tiiese 
geomeWes has its ovm value and utility as a logical model. The utUity 
of any particular one is determined by the app^riateness or adequacy 
of die basic postulates to the problem at hand, ^ 

One of the objectives of the scientist is to bring together^ reconcile 
Mid synthesize as mmy theories and concepts as possible. In the test- 
ing field we follbw tiie usual pattern of establishing scales to fit a 
particular need and tiien attempt to synthesize the properties of the' 
various scales designed fdr different purposes. On occasion we flnd 
Aat for complete synthesis we either have to abandon a desirable 
property or utiliie an unacceptable relationship. 

jytiiough we continually strive for a single scale with the maximum 
of desirable properties it would seem inadvisable to abwidon useful 
scdes designed for a specific purpose merely because they are not 
adequate for additional pui^s^s for which they were not designedly 

It should be emphasized diat the adoption by a te^user of any one 
of &e scales available does not exblude the use of ^y of the otfiers, 

[19] 
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In fact, die Use of more than one ty^ of scale leads to more adequate 
interpretation of results in most situation 



f ' ^ TABLE I, ' • ; 

K-SCORI MeAKS, SfANDABD DEVIATIONS AND SKEWNESSBS FOB EACH ObADE AT END 

OP School Yeab on Stanf^ Achievement Test Form D Given in^1940 a^^d 
~ ' - ' \ FoRM^j GivEa^ IN 1952 

Arithmetic Reasoning " ^ ! 
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Serecting Appropriate Score Scales for Tests 

^< ^ : \ i ^ i , 

LEDYARD R TUCKER 

; • ^ ^ ' ' - \ 

* SCALES MINIMIZING THE IMPORTANCE OF 
' ' REF^ENCE GROUPS , ' 

Scales for test scores hav^ been the subject^^OTany discussions dur- 
ing tHe history of ment^ testing and a variety of procedures attejnpt' 
ing to establis^. scales have been devaloped.' Tli^t score scales is stiU 
a live topic attests botii to 4ts importance^ and to the absence of a com= 
pletely |atisfactory solution. In light of-'^je axtensive^|4tdrature'and 
the numerQus schemes that have been tried for score^^^ales, I view 
with humility my' attempts at contributions to the field* Rather than 
attempting this morning to present final, all-encompassing solution, I 
am going to discuss four propositions which I hope will assist in 
clarifying thinking about the subject of score scales and then indicate 
the general nature of several possible procedures. 

As indicated hy-note 1 on the sheet distributed to you I aindimiting 
r'my consideration to those situations in which each test yields one 
numerical score for each examinee and this score is to be interpreted 
by some person, In effect, I am excluding two classes of tests ( 1 ) those 
tests for which the scores are entered directly into prediction formulas, 
and (2) those tests for which a number of scores are obtained over the 
same set of items. When the scores are used direqtly in prediction 
formulas, scaling p^roblems for scores on the test are^ irreleva^ry: to the 
present discussion. In the cm% of multiple scores for the same test 
performance, the situation is more complex than the situations, F wish 
to consider at this time. Some of the propositions and conclusions are 
likely j however, to carry over to the more complfx situatioji. The^e two 
resWctions will not reduce the area of discussion greatly. A point 
worthy of note is that we have not excluded subtests or sections in 
a test battery when there is one score per subtest or section. 

During this pa^r I consider it to be axiomatic that the score of a 
person on a test is used to represent the performance of that individual 
on the test The first proposition emphasizes the information given by a 
test score alone. In the general case, any particular score may arise 
from any of sevtol test performances. Consider^ for ex^mple^ an 
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eighty-item omnibus test composed of twenty items each of yOGabi^- 
lary, reading ^ compFehensioii, Numerical computations, ' and Jgure 
Mdogies; A score of sixty itenis right could be obtained by a number 
of combinations.of itefts answered co^tectly. One individual may have 
answered correrfly Vl items except the figure analogies while Jhothcr 
peraon may have answered correctly eU items except, the reading 
comprehension. The spore' of sixty' does net differentiate befvveen these 
two candidates, ^ ^ ' 

As a general principle I consider it to be pbvipvs that the meai^ngs 
^which may be given to a test score depend not only on interpretations' 
attached to the scores by varioift ^tudies after the test is constnictpd 
but also on the test itself, A telishoul^ be conceived in terms ofMie 
scores that will result, TTie kinds of meanings that n^ay be .attaihec^^ , 
-to test scores are .directly related -tg the nature ^f fhe bfehavior of, 
examinees .the test provokes anSkto our methods of i^grjmtioij.^ 
Proposition I indicates t^e possibili^ of a/^ibiguities among n\eanings ' 
that may be attached to a single score: Differentiation among ^^ssit^le 
meanings of a single score is impossible on the basis of th& score alone: ^ 
ITie infomiation given by tbis score is a complex of the possible mean- 

ings. ' ' . ' ^ L * 

A corollary to Proposition 1 might be rtated timt the'tiifotmatibn 
transmitted by a particular score would be m9re definite the more 
nearly equivaleat.were the possib^ meaninp of the §core, I will return^' 
to this point in disdussion of PropositiQns III dnd IV. j ^ 

In Proposition 11, consideration is given to ^e signifldanee of di^ 
ferences between two scores. For nn example, consider a Speeded 
verbal reasoning test. Score differences in the lo\yer range of scores 
may be indicative of differences in a complex of verbal comprehelision 
and reasoning abilities. In the higher score range, score differences 
may be associated to a greater extent with /differences an speed of ^ 
reading. The proposition recognizes not pn\^ the ^ j>ossibiUties of 
changes in the nature of differences' in test perfori^moes associated 
with score differences at various score levels but also the possibihty 
of changes in the extent of differerfces in test performances associated 
with uniform-sized score differences at the various score levels, In 
■ Some score ranges differenc^f^ of some given amoUiit between two 
scores may have much less sigilificance than the feme-sized score 
differences have in other score ranges. 

I consider Propositions I and II to be true of all tests no matter, how 
the tests are constructed. Aside from ppinting out the indiyisiblo 
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£^ character of any single score, th|i||^^qposiKons indicaW the existghce 
^_p£ 'E maxiinup of^.f^eedgm a^^^^^siblc rneanings pof seores/and 
^- differences betvf'een seores. lf^^^l^.*heedom m to meanings may 



be so great that.a con^lete IdJFor significance, may occur. The prcfb- 
lem is onepf beirig able.to h^niit the possible meanings so as to obtain 
sucK-flefinite inforitiation as 4s de^sired. In ^n ideal test, as ^i^dicated'j 
in PAposition IIIj, the information 'given hy ^ach score Wmild Ini^jy" 

.a ^single in tf^refatjon/ Uni£prm-sized ^ffferyicey between ..s^fcs^^^ 

^W0ul4' Wii'cate^a'singlp.k^^^ and exfcnt of differences betweOT cor- 
respoilding test performahces. Such a test would ma^cirniz&the definite- ^ 

6 ness of th^ information transmit texl by4he scor^^s. * ' 
It ii^ro be^noted that the establishment of the conceptlof aivi»eai' 
test doGg ttot^limjt the nature of the score continuum. In Ahe prfisent^ 
state of the arf of testing \ve will prolsably be abhi to appfoaclTxcioser 
to ^his Jdbal iiTsome areas of skill and kno^vledge than i^j)ther areas. 
No matter how poerly or well ^Scanjipproxim^^e an idcj^l te.^t fox a 
characteristic wish to -tests the concept of an ideal test indicates a 
worthwhile goal. The more ne^Hy can appToac^fliis goal of an 
ideal testj the tnord definite will be fhe information given by scores 
on the test - . ^ ,^ * 

Proposition IV emphasizes the point that a unitary continuum may 
he achieved in a variety of ways. It is important^ though, as indicated^ 
in note 2, to establish the homogeneity of each continuum considered 
fot ah ideal t^st^nles.^ the continuum is ho^riogeneous in some sense, 
an ideal test iSjimgossiblcAm^iguitijs o^ score interpretation are the 

' natural result of heterogeneity in meaiTings of test sMosrln order to 
obtain deflniteness of informa^tion transmitted, it is imp|mtive ' that 
measui'es be taken to obtaimtiomngeneity of ^he sdore continuum in 

,^ome desired ^?ense. ^ i 

I have listed in Proposftion IV two senses in which the score con- 
tinuum may be homogenebns. The first^sense depends on discovery of 

, homogeneous' traits in the behavior of , the examinees. This is the. sense* 
basic to considerable work in psyphological research, In contrast, the 

i^second sense depends on a homogeneity of evaluations of behavior. 
One mjght judge two distinct behaviors, of individuals as being of 
^qual valifte in some field. The fact that/the occurence of these be- 
haviors is uncorrelated for a group of examinees \voul(l be irrelevant 
For exam^lej constder a test such as^ "understanding of social environ- 
ment/' An understanding of an economic principle such as the law'of 
supply arid demand ^mlght be valued as highly as understanding of 
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cuirent politicar eventsf The homo^neity exists, if^ at all, in the 

opinions of the' examiners and not necessarily m the behavior of the 
exapflnees. Jit is important to distingiiishf, these two senses in which 
flie score coritinuum rnay be homogeneous, Quite diffierent modes of 
procedure are appropriate in developmenT of tesf^gnd score scales 
ior th^se two senses of scoEe homogej^ty. . <^ ^ 

] Turning our attention nowlto the test development and score scriing 
problem for eath of the two senses in.wbichrthe score con ti^iuum may 
bg. homogeneous j consider* the first, sense, homogeneity of behavior. ' 
^Tiurnber of techniques, including correlational and factorial analyses, 
Tiffve been developed to^^tudy the homogeneity of behavior. General 
agreement exists that the individual ^differences within a homogenepus 
domaiii of behavior ^will produce a hierarchical table bf intercorrela- ' 
tipn^ for the population under consideration. A supj^mentary type 
of study involves the difBcurties' of items as indicated ^by> proportions 
of groups p[ examinees who give particular responses. The population 
for whicn the test is to be appropriate would be divided into select - 
groups oa^thf basi^ of ^whatever av^lable information.^4s relevant tO/X ^ 
performanee fan the A sample of people in each group would 
be exangjlied and the difficulties of the jtems would be obtained for 
feach Sample. The sets of item difficulties should be systematically 1 ^ 
related^"In^a present experiment, vofeabulary test materials were' ad- ' 
ministefeH to studetits ir\ the seventh and tenth grades of schools 
located in each of four categories=-defln&d .by high versus^w^ socio- ^ 
economic Atricts in which the school is located and by locatjQii iu; 
the north-east versus south-cast regioffs of the United States. -One 
group, tfuSj includes schools Ibcatedjn k)w socio-economic districts 
in the soyth-east and a second grou|^ included schools in high socio- 
economic districts In the south-east* Two similar groups were defined 
for the north-east Item diflBculties will be determined for each of 
these groups. Our question is whetJier the items will retain the sirne / 
rank order' in difficulty when the item difficulties are based on such 
dttfferen^t groups. Only in case that both such invariance of rank order ^ 
in item difficulty and a hierarchical table of correj^tions exist should 
the domain defined by the items be considered as homogeneous. 

Once homogeneity in a domain of behavior is established, a score 
scale Js to. be determined 'so {hat each score represents a particular 
point on the CQntinuum. One might establish groups of items of equal 
difficulty and arrange these item groups on a difficulty scale. The 
preceding check on invariance of rank order of item difficulties 
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fflcmtates tjus step of, grouping items. An examinee would be p^d 
on tUs scale bj^ the group of items that he was able to perform. «*■ a . 
«^ 4st latisfactory level. On groups of easier, items, the examinee would 
\ ■ r feteiwljetter than just satisfactorily and dn groups of more difficult 
ai^' NtwtiJ& vwUld perform less than just €aHsfaotorlly. The exammees 
.^ .^T^iore ^ulcUbe deBned in terms of the group of items he Is able to 
'^Wi|erfx)^#it a* just satisfactory level, the test user_ would interpret this 
■5" ' stoti'dirlot^-4s Ae level of proH^ency which these items represent. 
Aii altemaave protedure is to use a/%core on the test composed of 
. the items 'to establish sub-groups of examinees having approximately 
- equal pbility in the function under consideration. All fexaminees whose 
. ^ ^ scores were within some narrov^ dm interval of scdres would con- 
stitute .each of such sub-groups. The, item difficulties would be ob- 
tained for each of these suB-groups. an^a check on invariance of the_^ 
rank order 6f item difficulties would' be mafle across the several sub- 
groups. In case the rank orders were stable/a scale of item difflculties 
could be established. Each sub-group would, be located on this scale 
by those items wltji difficulties for that sub-group at some defined 
level say 70% correct. It is to be noted that tills scale does not depend 
J on the number of examinees who are placed at any particular score - 
value only the prdporHons of examinees giving the correct answers 
to the items are used. The scale Is Independent of the shape of the 
frequency distribution of scores. The homogeneity check for the 
population guarantees independence of the scale from the particular 
group of examinbes used to establish the scale values. 

Gonsider the second sensed in which a score ootiHnuum might be 
homogeneous- each score Indicatini placement on an evaluative scale 
Our meAodology can now turn to investigations, of the opinions ot 
people who will be considering the evaluations. Do the value opinions 
form a homogeneous field? Or, do separate "schools of thought occur 
which would alter the relative order of examinees in the, evaluations 
Biven by members of different schools of thought? Methodological 
. developments are In progress In the field of psychometric scaling 
• methods which show promise for application to. this problem. Once 
homogeneity of opinions is established for some defined domain and 
a scale' ■of v^ues for belwvior Is developed, a test may be constructed 
^ which will locate individuals on th'ls scale of values. Points on the 
scale would be defined by those behaviors which were values at those 
points Such scales would depend arectly^n the group of judges mak? 
STg the evaluations and would depepd-orfy indirectly on the behavior 
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of the populatiop to be examined* A questi^ could/still remain as to _ 
how opinions are influenced by observations of behavior, 

Vt^en either type of scale indicated In the foregoing discussion has 
been established, a surviy of perfoririances of a population would be 
desirable. Comparative data between examinees would then ibe pro-, 
duced. The advantage o£ use of iqleal testSj as here conceived^; and the 
resulting scales in the survey operation is that definite infotmgtion 
' about 43ie distribution of the population as to levels of performance 
would be obtained* ^ * 

Note 1 : Consideration is limited to those situations in which each 
, tmt yields one numerical scqre for each examinee and this score ^ to 
be interpreted ^^onie person. . 

Proportion .^Each test score by itself transmits the same inforrfia-^ 
tion fdr all examinees who receive that score, This information may 
indicate some complex of qualitative and quantitative characteristics 
^ of the test performances, 

. Proposition II : Considering for one test two raw scores djffering by 
one unit/ the' information transmitted indic^es some complex of 
qualitative and quantitative differences between 4e two test per- 
formances with the kind and extent of these differences in test per- 
formance possibly changing from ene score level to another. 

Proposition III: An ideal test may be conceived as one for which 
^le information transniitied breach of tlie possible scaled scores 
represent^ a Jocation oif some unitary continuum so that uniform 
differences between scaled scores correspond to uniform differences 
between test performances for all score levels* 

Proposition IV: The score c^ontinuum for an ideal test may be 
homogeneous in any of a number of senses. Two basic senses are^ 

1. Hie scores indicate extent or degree of some trait which exhibits 
homogeneity in the behavior of examinees, 

2. The scores indicate plaqemenl on an evaluative scale for a 
category of behavior considered in a unitary fashion by those 
people mdcing the evaluation, 

Note 0: It is important to investigate the homogeneity existifig^ for 
the sense in which the scores are to form a continuum. 

Note 3: For each of the two senses listed in Proposition IV, experi- 
mental and analytic methods for test development and score scaling 
may exist or be developed which dg not depend on the relative 
number of examinees* who receive each particular score in a reference 
. group of examinees. Such methods would yield scaled scores indica- 
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tive of levels of performance on the test ratiier thai^ \ comparison of 
relative position^ examinees in a group/ The comparisons of ex^ 
aminees may be performed^ m a separatfe, later step. 



[28]^ 

25 



i Selecting Appropriate^Scpre Scales for tests 

J 0 H N C : FLANAGAN 



/ .DISCUSSION 

In DisdussiNQ the matter of how we were goiitijj diyide up the (iiscus- 
slon, Dr. Lindqulst and I did not have a chance to review each'other s 
•remarks, partly because thtey were not' entirely fo-rmulated, there Was 
some delay in receiving the remarks, and partly because we thought 
if might interfere with the spontaneity of thf discussion. 

We agreed that we would choose different topics. He is to talk about 
the basic consideratioTis involved in this problenfand I am discuffiing 
the fundam,ental principles. " . , i- 

One of the fundamental principles we have to deal with in this 
problem of scaling of* test icores is that vve do not ' have any ideal 
scores such as the ones that .Ledyard Tucker has been triking' about. 
In practically all cases of tests with which I etm familiar you have 
much more information if you know exactly whm the response of each 
man to each item is tlran if.you have a simple summary score. In other 
words, we must remember we do not have ideal tests. Presumably Dr 
Tucker is talking about an ideal mathematical model which we will 
never have in practice. We may approximate it in many situations but 
for pracHcal purposes it is Just something to think about, not some- 
thing that we will be able to use. 

For example, if we take 'a test of history, it is ridiculous to assume 
"that one teacher's group will hot learn more about some particular' 
types of items concerning Betsy Ross or the Civil War or some other 
happening than some other teacher-s class. Therefore we will never 
have this' perfect homogeneity which is necessary in most types of 
achievement tests. The one place where we might get something ap- 
proaching homogeneity is in some sort of power scale. We have a 
power scale in situations in which the items can be so arranged that 
if you can do a speciBc item at one point on the scale, you can obviously 
do all those earlier on the scale. In such scales we do not have any 
speciacs. Training affects all.item^s equally. Nothing that happened 
yesterday morning or that you read in the newspaper can affect one, 
item and not tRe others. W^en all these conditions are fulfilled we 
have a homogeneous scale. 
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' - It seams imUkely tot an eduqatioDal test of dus type wUI be fqund 
and ttierefore it appears that &iB is a modfl to ^nk abo|it and not 
someAmg foif practical use. Certainly such a test will not ba found 
, of voeabj^wfy. Obviously it is sUly to ftink about a perfect order of 
V diMculty vddibulary it^s. They wtm going to differ in difflculty 
aecording to what you have looted up in the dictionary recently^ or 
what somebody else has populaxized. Similarly for mechanical prin- 
iciples items, S^dfic experience is going to prevent you from ever 
. ' having one of those homogeneous behavior scales. 

He o&er t3^e of^ideal scale suggested by Dr, Tucker involving 
homograeity of opinion ev^uations seems even more remote from 
any possibility of re^zatipn. We usually use as our score the number 
of right answers, or somfe function of tids, recognizing tiiat this is ah 
over^impUA^tioDp We would theoretically be better off to weight some 
items dtferentiy. Each of these items is not of equal value and equal 
importMce; it cannot be exactly as important as each of the other 
itetps in detenrtning what .we are trying to detenjiine. It must be 
, ; recbgnized Uiat having Ae exact pattwii of how each person performed 
. pn ead^ item contains more information than we can get out of any 
single score. 

^Assuming Aat we have a simplifled absh-agtipn of this performance 
imterms of a score, what is the fundamental principle for determining 
hqw we should ej^ress these scores P It seems to me that tiie funda= 
lyiental principle governing our behavior here is utilityj what scores 
are going to be most usef^ to us. T^is depends on pujTposes. I would 
thlhk that there certainly are, as Dr, Tucker saidj some purposes for 
wWch certato'test scores will be more valuable tl^an others. 

For many types of an^ysis and study of test scores such as for 
. Addiction problems, we are going to w^t to use tiie scores to 
calculate product moment correlation coefficients. It is vety desirable 
in getting an estimate of the correlation* in the partieular population 
to have the same shape of disbribution for die scores of tiie two 
variables. In otiier words^ if ypii have a skewed dishribution in one 
variable, you wm^a distribution with tiie same skewnCss in Ae odier 
variable, Tbis will yield the maximum possible correlation between 
these ^0 variables. It seems to me^ therefore, for a lot of pu^oses 
if we can msdce tiia disMbutions normal, we have more likelihood of 
obtaining consirtent results thm if we skew one of them one way and 

On 'fte bier hand, if we normalize them from one population and 
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find ttat tiiey are radically skWed for otters, tiiis perhaps would 
: suggest ttat some variatiofl in the baiie tj^e of scale shoidd be in-V 
trodueed. i . ^ / 

. I should like to review Ae fundamental principles in establishing 
a set of s^^d iwieB. Th% Bx§t of &ese is tte reference point Asium- 
ing tiiat ypu a^e going tg modify your raw scoreSj Aere is no lue 
" changing by ading 10 or 15 or subbracting ^ or something from your 
ra# scoras or maldng some o&er change, unless you are going to get 
some meaning into tiWs fundamentd point of reference* 

One of die moaius^ul scores that we have had has been tte LQ., 
because of th6 simple meaning of a^core of 100* The I.Q. .score has 
other difflculiJes, as pointed out In meetings here in tiie past couple of 
days, but die fundamental point of reference of 100 has been exb-emely 
useful We Wed to capitdi^e on this type of thing in getting a funda- 
mental point of referenee for scaled scores. The 50=point is very 
similar to 100 LQ. injfs fundamental meaning. Similarlyj in establish- 
ing the stanine scale, we have estabUshed 5 as a point ^ reference* 

The second problem is tiie size of unit. It is important tiiat fliis size 
of unit have some meaning. Some simple^ easily remembered meaning 
is desirable,, Making ffie standard deviation equal to 10 or 2 or some 
simple multiple of this sort is -useful because many of us deeding yriiii 
mch- scores remember the unit normal distribution ^d therefore 
- know about Hov^niny scores can be expeGted to be as far as two 
standard deviations from the mean. This tells us immediately some- 
thing about t^e scores. 

'^'The* other question involved in size of unit is coarseness. In the 
scaled scores we used a standard deviation of 10; in stanine'Sj we have 
a standard deviation of 2. There are some scales using standard 
deviations of 100, These are 3 digit scores as" compared with die one 
digit that have been used frequently. I doubt diat there are very many 
tests which justify three digit scores because of their accuracy of 
measurement and the uses to which they will be put. In the military 
services With a day and a halfrof testing and 10 or more scores going 
into each compositej we still reported die composite on a 9 point 
scale. Certainly to put a 15 minute or a half hour test on a 3 digit 1000 
point scale seepis a little ridiculous. 

In some new work which I am doing on aptitude tests, I have 
decided that a 27 point scale in which you used die 9 point scale with 
a plus, minus and zero would be^a little prrferable to the 9 pomt scde. 
For most puraoses people are hot going to pay much attention to Ae 
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plus, minui tod zero, buf for some piloses, especially \^here you 
are deding Mdth a p"Oup, iJl of whom are at tiie high end of tiie scale, 
^s^j^ight have some vJue 'in breaking ties, altiiough reliability of 
to. Icores may not be sufflcient to m^e tills breaking of ties of very 
mueh praotib^ advantage to the user^^ 

The question of equality of units is &ie last of Ae fundamentd 
prinmples to be discusied* Tlie problem of utility is the primary factpr 
here. We want to get distributions which have as similar shapf to 
other distributions as possible and with as much constency of shape or 
distribution as possible from'^ona population to another. I think fliis 
tavolves a certain amount of trial md error. If we find tiiat one 
pmticular shapej say Urn rectangular "disWbution as from percentiles 
in the sixth grade would provide Basic units which would distribute 
result from the fifth and seventh grades rectangularly also and show 
similar consistency from one ragion of the country to anoUier, I would 
certatoly say we ought to use. units yielding rectangular distributions, 
I think, as most of us have experienced^ this ddes not happen. We are 
much more likely to get consistency if we have a normalized set of 
scores as the basic units. 

As to whethar or not some elemerit of skewness is important -for 
some situations I think we do* not have adequate information ye^. 
Certainly diis field should continue to be explored along these lines 
Dr/ Gardner has been folio wingi 

One other point that should be made is that in setting up these 
scales we should distinguish between ideal properties of scaled scores 
and practical factors of convenience and cost. In planning for* the 
battery W aptitude tests which 1 am publ,ishing shortlyj a comprehend 
siva review of all the circumstances suggested the best thing to do 
would be to have the stanine of 5 represent ^ random sample of 18- 
year-olds in the United States population. Having thought about this 
and having explored the possibilities .of getting this done tiirough 
draft boards and similar means, I finally rejected it. I wish to make it 
very clear that it still seems the best thing to do but not within my 
resources. As a practical expedient we are adopting as the basic 
reference group Bittsburgh public high school seniors. 

Certainly such a reference group is homogeneous ahd has certain 
advantages, Howeverj wa should make it very clear in our discussions 
whether we are talking about what we ideally think we ought to have 
done, rsttll tiilnk that a rwdom sample of 18-year-olds would be 
better. That Just did not seem to be feasible to me at the present time* 

■ ' . . ' [32] 



J 



29 ■ -k 



• T ES TI MG PflO B^^S— , 

* I AizJ^ in closings ^at wa should^ txy to keep in ,mind that any 
system of developing^ seriei of" scores should be for practical utility 
and it should be demonstimbly more useful to people who are going 
to use md ihtmpTet ftie scores flian the other procedures^ raw scores 
ttiat are available to them. 
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DISCUSSION ; 

Thane you, John, for sticking so close to die agreement, TTiis has 
workad out i^retty well. I can see almost no overlap or similarity 
between J6hh*s fundamental prindples and my l^asie considerations* 

What IrTpropose to do is to present to you a numbier of what I have 
termed basic consl^jeratlons in seeing educational achievement tests* 
Thme are designed to support'a particular point of view witii reference 
to tfie whole problem of scaling. You might say that Aeir purpose, 
so far as ffl^^ to create m attitude or a 

change in attitude, if possible, toward tiie scaling problem, 

I aiink it might be well to st^ off by specifying some of die pur-^ 
poses of sealing* Perhaps dils is so obvious it hardly needs to be said, 
but I would suggest that scaled scares are needed for Uurea general 
purposesi flrst, to facilitate comparisons tetween performances on 
different tests and thereby to provide a basis for computing properly 
wei^tett composites of: scores on different tests. Second, to facilitate r 
comparisons between differences in performance at different levels 
or tte Jame or different testsi^and, diird, to faciUtate the presentation 
jf normative data, eitiier by incorporating some of the normative data ^ 
in the sealed scores themselves, or by mdcing it easier^to organize 
and present die tables of norms. It is much easier to prepare m^uals 
or tables of norms if all can be refemd to a single reference scale 
diati if one has to refer each to a raw scale for each individual test. 

Of diese three puiposes, perhaps' the most important is the first 
aldiough I am not Interested now in arguing the relative importance 
of diese purposes* » 

^ Perhaps before going furdier, I ought to say also tiiat my remarks 
will be pretty much restricted to applicatioiis to^ edufcational achieve- 
ment testSj and will not include psychological tests of various types, 
iLptifude tests, interest inventories, aM that sort of thing, I should like 
to begin, ften, by another obvious statement, defining what I mem 
by an educfttionid achievement test An educational addevement test 
is one designed to reveal differences ^mong the examinees in tiim 
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extent to which they have attuned a particular educational objective, ^ 
-or set of objectives. Wift reference to tiiat definition, a good educa- 
tioiial test, it §eems to me, must by itself conititute a complete and 
adequate defln^on of the objective which it is conceraed. ITiat 
is^becaus^ m it worta out in practide, 4e test itself so frequently 
be^mes ^e end of Insfruction. Teachers and principals make it their 
business to improve die average scdre ^n many of these testSj and ^ 
unless die things that tfiey must do In order to achieve higher score 
averages are precisely the things that we would lAe to have Aerrt do 
--in other words, unless die tilings measured by die test are precisely 
die ^ucational objectives to be achieved— we are going to get into , 
serious dlflSculty. . ^ , 

A good educational achievem^; test,^ then^ must itself define the 
objective measured. Kiis means diat ffie method of scalins-^^n educa- 
tional achievement test should not be permitted to^^^ermine the 
content of the test or to ^ter die deflnition of objectii^s implied fn . 
die test. From the point of view of the tester, die deflnition of die-i^,/^ 
objective is sacrosanct; he has nonbusiness m onkey in^widi that defini- 
tion, ITie objective is handed down to him by those agents of society ^ , 
who are responsible for decisions concerning educational objectives, 
and what the test constructor must do is to attempt to incorporate 
^at deflnition as clearly and as exactly as possible in 'the examination 
diat he builds: . -. . ' 

Now, the statistical properties of educational achievement tests and 
of test items are to a very large degree a function of arbitrary features 
of the school curriculujn and of variable features of die examinees. 
Dr. Gardner gave some very convincing evidence of that. I should likd 
to add just a Utde bit to it only by way of samples of die kind of diing 
I mean. I have the data on a few arithmetic test items that were tried 
out on a very large population of elementary schools in the Iowa Basic 
Skills Testing Programs/ For instance, one of these items reads, "multi- 
ply 500 by 8/' The difficulty of that item in die diird grade was 4 per 
cent— diat iSj 4 per cent answered that item correcdyi in the fourth 
grade, S5; in the flfdi grade, 84, and from there on it maintained that 
high level. ' " 

Anodier item reads, "divide 84 by 2." At the beginning of die third ' 
grade this item has a difficulty of 13 per cent, By the end of die second 
semester of die diird grade it has a difficulty of 57 per cent. By &e 
beginning of die first semester of die fourth grade it has a difficulty of - 
83 per cent. . 
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One mora Itamr^add % and has a difficulty below 5 per 

ee/t in each of grades 3 and 4, in grade 5 it hai a difflculty of S p^r 
to fte sixtt grade the difflciilty jum^ from S to 78, wid to the 
next grade it^is 86. h 

Now, ^hat aecouflitf for thei©i^rupt changes in the difficulty of the 
item? Is it attribu^ble to soma naturrf charige to the nervous or 
physiological mat^^ity of tiie child? Obviously not* It depends only 
upon certain aAitrary derisions tiiat have been made in the orgwiiza- 
tion of the school curflculum. TTie schools have decided to teach this 
item in this gradej and that item to anothfr. Tomorrow they might 
change their minds. Tliere Is nothing magic about these abjrupt 
chailges^ 

Those changes in difficulty occur not only from grade to grade, but 
even from school to school within the san^ grade. TOe item, "divide 

84 by 2s** was tried out in 12 different schools the same time of the year 
and under, the same These schools together "yielded a 
sample of about 600 pupils, In School A, the difficulty of ihe item was 
39 per cent. In School B, it was 6 per cent. In School C, it was 82 per 
cent. In School D' it was 100 per cent. From School B to School D, 
the range, in difflculty of this one item was from 6 to IW per cent, 

The item, "subtract % from %" has a zero difflculty in School A; an 

85 per cent difflculty in School B. 

The item, "multiply 0.24 by 524" hlis a zero difflculty in School A, 
and 82 per cent in School B, and a 0 per cent difflculty in School D. 
Clearly, this is because these schools could not agree upon tiie point at 
which this item was to be presented in the curriculumj and so in one 
school the item was extremely difficult an4 in another school it was 
very easy. 

The decisidils that characterize the differen^s between these 
schools could easily characterize entire populations. All of the schools 
in one population might decide to teach one item in one grade level 
and of them in another population to teach the same item at another 
grade leveL 

A^ methods of scaling educational achievement tests now being 
considered are based upon the statistical properties of the test or of 
the individu^ items constituting the test witfi reference to a particular 
population of examinees, which was Dr* Gardner s ftiain point, and 
wiA which I would agree absolutely, That is, all scales are derived 
from normative data. 

Now, raw scores on some educational achievement tests are mean- 
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* . in^ul in ttiemselves in temis of tiie content of the test. For example, 
. you jnight build up a test of the 100 basic, addition com binations in 
ariaimeti^ and you might flnd ffiat a particular student Mswers cor- 
recUy 60 out of ttiose 100 items. That obyiously means something with= 
out regwd to aayone elie*s perfofmamce on that test, TTiis is what we 
mightj lor pulses of ^tfiis discussion, cbA an absolute meaning, or a 
^J^^^mdamenti meaning, fllut groups of items arranged with reference to 
^ s^di meanings dd nbt constitute scales; You cannot compare TO out of 
100 basio acldltion facte with S out of 20 rules of grammaf, or with a 
&rtain number out' of a possible number of vjocabulary items in 
f French, and so on. Even tiiough you have grouped the items in this 
' respect, you must still attach numbers to those groups that will make 
tiie performance comparable from group to group. In other words, 
the scaling |ob sMll has to be done after diis grouping has taken place. 

-.1 Any mewling tiiat a s^ed sco^ has, in addition to tiiat contained in 

tiie raw score' it has ±>ecause of the normative data incorpo^ted in 
tfie score, imd tfiat me|ining applies strictly only to the particular 
reference populatiort involved in the scaling process. In other words, 
no scaled score has my fundamental meaning attributable to the scale 
itself. Whatever meaning it has, in addition to the kind of meaning I 
Just discussed, it has because of the normative data incorporated in 
the score, 

. .^It is impossible to incorporate in any single scale normative data for 
more than one reference population. However, in order to interpret 
satisfactorily die scores on most educational achievement tests, one 
mustTefer to data f or >4arge number of different reference popula- 
tions. The kind of tests we are talking . about, elementary school 
achievement tests, make, that very obvious. You have a different dis- 
^bution of scores on the test for every one of grades, say, from 
the third to the eightii* You have another different distribution of 
scores if you throw all of those grade populations together, which is 
aie population that we use in effect when we establish a grade 

_ equivalent sc^e. Again, you may wish to interpret school averages, 

and they must be interpreted with reference to distributions of schopl 
averages, not with reference to distributions of pupil scores-and there 
again ym have a different distribution for every grade, and you have, 
a different distribution for all grades ttirown together. 

Furthermore, as has been suggested, you will get different distribu- 
tions within the same grade for the :same teit fr^ one geographic 
population to another. There are marked differencet^n Iowa, between 
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,tte distributioni of scores for one-room rural scKools and die distribu- 
tions of scores for schools in commurtities of more Uian 25,000 in popu- 
lation, on exaetily the same t^t* It is Hms possible to identify a very 
large number of reference populations^ all of which must be usedfcr 
satisfactory inte^retation of tiie scores ojq any of tiiiese tests* _ 
^ Accordingly, whatever we choose to regard as the basic scale for a 
test, we must always set up alongside tiiis scale a large number of 
otiier scales, or if you prefer to call them thatj tables of norms^ each 
of which is to be employed for a different purpose* In most, if not 
nearly all, practical situations it is difficult to determine which of tiiese 

* purposes is most important, or which scale is really the bamc scale. 
Indeed since fte reli.tio^ships among the various scales can always be 
detertnined, that- is, wi4 reference to a particular population it can 
always be determined, or since any scale can always be expressed in 
terms of any ottier scale for tiiat populatibn, tjiere seems to be very 
little point in taying to determine which^ scale is basic. However, it is 
usually desirable for reasons of convenience to select one scJe to be 
. n employed as a reference scale* This is tfie scale to which tfieTaw scores 
are* oftfen Immediately converted in tiie scoring process^ and in terms 
of which the only original record of the scbres is made. 
Wttii reference to this reference scale, from at least one point of 

^ view— ^^haps I - should say before presenting tfiis point of view it is ' 
not my ''Own, it is not one by which I would abide in practice^ but it 
certainly is a point of view that deserves consideration— th§ best iype 
of reference scale for a test is one that is divorced as much as is pos- 
sible from any normative meaning* I repeat, from one point of viewi 
tfie best Idnd of reference scale is one completely devoid ofTiorma- 
tive meaning* For example, the scaled scores along the reference scale 
for a test might be simply the eorresponding raw score expressed as a 
per cent of die possible score on the test* TTiat might hav6 some 
absolute meaning of ttie kind I discussed earlier, but it would have 
no normative mealing* It would not be a good reference scale, I wmt 
to say at once, because tiiere are too many connotations, undesirable 
connotations attached to tiie per^ cent score on the test. But the use of 
a scale toat is divorced from normative meanings has tii^ very distinct 
advantage tfiat if Qie norms change after the scale has been established 
=and tiiat does frequenUy happen— ^then there is no need to abandon 
tii^.scale m ^at accountj or to rescale the test. Instead, all one need 
do m that case" is 16 leave tibe reference scale ^s it was before, be- 
cause it does not depend upon normative meaningSj and make whpt- 
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. ever diMges in &e nomative scdes aasodated with it happen to be 
^appropriate, mi those changei may affect only somd of the normative 
scdes. ' ^ " , ' 

Let me now attempt to it^marize what I have been trying to say. 
For education^ achievement tests--I mean specifically tests like tiie 
Stanfoid and MefropolitMi Aduevement test batteries, tiie Iowa Tests, 
Baric Sldlls and fte projected ETS teiti of basic education^ objec- 
tives-the amount of meting ttiat can be built into any single refer» 
ence scale will qonstltute only a very small part of flie total amount of 
..meaning to be derived by all of the test users from those test results. 
To a vety conpiderabie extent, tiierefore, any so-called basic scde is 
priiharily a device for facilitating the presentation of oflier scales* 
The scale value of a given perfonnance on any of fliese other scdes 
will be exactly tiie same for a given reference population rdgardless 
of the nature of fte rrference scale Uied, because fliere is always a 
monotonic relationship among these scales. Accordingly, die problem 
of what size scde, or what kind of S reference scde is to be employed^ 
witii an educational achievement test is a problem, in my opinion, 
of relatively minor importance. The major problem is what ida?6§ 
should be emij^oyed widi educational achievement tests, scales in die 
plural raaer ttan scale In the singular. That if, what kinds of norms ^ 
should be provided tte test, and how and for what purposes 
each should be interpreted. 

I should like to conclude witii Just one or two more specific com- 
ments with reference to the proposals or suggestions diat have already 
been made. So far as these so-called basic considerations are con- 
cemed, I would certainly have no objection to the use of a scale such • 
as Dr/ Gardner suggests. I would oiJy w^t to point out you might 
have to establish a scde of diat kind for each of a very large number 
of different reference populayons,i(fcause a scale of that kind estab- 
lished for one reference population will not serve all otijie purposes 
Aat have to be served, or even a very large proport^ of them. , 
I should like to ^int out, also, that comparisons were made by Dr.. 

^ Gardner of score disWbutions on the K scale for different grades. 
Now, with die K sc^e certain assumptions underlying those com= 
parisbns, Hie first is tiiat there is a common growtii curve^ for all of 
' die tests involved,, but what kind of a growdi curve you have for a 
pMticular test depends, as pointed out a moment ago, upon what ' 
arbifrMy decisionr have been made by the schools vdth tegard.to die 

- presentatipn of tiipse particular skills or abilities in the gumculum. 
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Thm K alio asaumei uoifonn within grade variability^ but what 
variability ,wthin a grade for a particular teit is depends , 
upon ttiese arbitirary decisioni. We know, for example, that 
sMi gti^de Ae variability .\W&in grades for an Ametic funda- 
meutali^ in tenns of differences between successive ^ade madiaissj 
is very mtldi smaller in arithmetic than it is in reading* In tlie case 
of the Iowa Tests of Basic SkillSj ttia sevenft grade norm lies at die 
9Sth percentile of the BtxQk grade disfribution on the ariflimetic funda- 
mentals tes^ but for reading comprehensionj die seventh grade mediitn 
lies at ie 65th percentile in the sixth p'ade distribution^ and for a 
test of basic social concepts which we recently have tried out in lowm 
die seventh grade median hes at about the SSth percentile in ^ die 
sixth grade' distribution. This is a terriflc difference in overlap from 
grade to grade, a terriflc difference in relative variability from grade 
to gradOj whiehj it seems to me, is obviously atbibutable to differences 
in curriculum decisions with reference to grade placement* 

Now, flnaliy, widi regard to Dr, Tuckers proposal, I will say very 
much the sama thing that Dr. FlanagM has said. If Dr. Tucker does 
succeed in flriding a number of items that all happen to have the same 
rank order of difficulty for a particular group of reference population, 
he will find diose items only because it .huppens to be brue of die 
school curriculum that die schools have attained some agreement on 
tho|e particular items, and that will be* mor« or less an accident radier 
than anything fundamentally descriptive of the child. ^ 

Furdiennore, if he does find a number of items of that character^ 
th^^ items will not define any of the educational objectives which 
have been handed down by such committees as the Mid^Century 
Committee on Educational Objectives. It will be a matter of accident 
what those items define in the way of an educational objective, So 
I would say that while I would be perfectly willing to accept the kind 
of scale that Dr, Tucker suggests, I would not— because of the second 
principle diat I suggested, that a good achievement test must by itself 
constitute an adequate definition of the objective with which it is eon- 
cemed— on that accountj aocept the kind of scale that Dr/ Tucker 
Suggests. , ^ 
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THE MEASUREMENT OF HUMAN MOTIVATION: AN 

BCPERIMENTAL APPROACH 

■ , ■ ■. ^ ^ - 

What I have to say tiiis morning will be somewhat of a change 6f 

^paoe from what you have been listening to^ since I approach the 
m^isurament problem from an experimentsJ point of view rather ftan 
^ froff fte IKditfbnifl t 

*I should like to review first flie different ways in which psyGhologists 
have attemptedl to mei^re motivation in the past In tfie first places 
&e simplefts^wayj apparmtly^ to measure *human motivatibn is to^ask 

j a subject how motivated he is for something or odier* iTie ^ychdlogist 
always starts vdffi tiie simplest approach^ just ask the subject. Of 
course, we psydiologisti did this and we did it elaborately* We did. It 
by setting" up self-rating series- we drew graphs to shows the normd 
disMbution, and we urged tfie subject to follow Ae normal is- 
tribution or putliis check maAs in some kind of a patterns but funda- 
mentally the method Involves simply asking the subjf ct how' motivated 
he 1$, " . * 

I db not need to tell you, I think, what thd dmculties wifli this 
approach are. One of &e major ones is, bf course^^at subjects have 
Afferent subjectiva standards, and ff you ask diem how moHvated tiiey 
are for ^hievement each one will have a different idea of what intense 
achievement motivation is, so that whan you try to compare their 
self -judgments, you get **hash/' ^ 

Ilia se^nd a^^^ch is, if you can^t ask the subjects then ast some- 
body else how motivated he is. Of courses yob choose somebody Aaf 
kaows hta fairly well, presumablys like, a teacRer, and you get Ae^ 
teacher to rate Ae pupil on how motivated the pupil is,^ If you don't 
Aink |eachm are ^pable of doing 4is correctly 'or valfdly^ yott pto 
ask a ^nic^ psydbolQgist, who may sty.dy tibe person for several 
weeks, or even sevo^i^ears iS he is a psychoanalysts and Aen get htoi 
to! At^^ a rating as to\how motivated the person is for achievement. 
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And i su^pos© tte^fcucd piyd^logiat de^ not feel he is capable of ' 

t,doing it, you ^n dwayi ask a psychiaWit, whose judgment may be 
even better. ^ ^ > 

But agun I do not aeed to talj you Wb difflculttes with ^ methp- 
dblogieal ^^^a^. One diflBculty is circumvented you ask tiie same , ^ . 
Judge to Judge leveyU people, hf^ more or less' keep his standards # 
tiie ikme, so Aat you do not have the problem of shifting norms quite 
as badly as you do if you ask the subjects to rat^ thenftelVes, But other 

.diffleulUes arise, For example, it is not exactly clear what the judge 
is judgingj just what h^ deftoition' of ;tiie motive is. His definition 
isn't ^^^ys communicable^ V ^ 

The chief objection ^o,tiijs>^pro4ch is partly practical and partl)^ 
aUffeeticaL Practicdly, I do hot think that judgments of motivation 
,l;iave proven extremely fruitful in predictirig performance, and I siis- 
pect that ttie reason is that the judgments are not pure enough, Too 
mimy factors are taken into account in a clinical judgmen^fco tiiat itis 
difiBcult to tease out precise relationships with performance* TTiis 
diflBculty ties in wth the theoretical objection that I have-an objection ^ 

: whi^ can be highlighted' by comparing tiie process to asking a gro^^^ , 
of physicists to measure temperature. by pooling their juflglftents as tb^^'^ 
how hot it is or how cold it is. You can undoubtedly |e< a reUable" * 
estimates that is, you can get to agreement this way^ibut it isnt exactly 

: measurement, I have always ^een interested in pushing our measure- 
ment bfmoHvation more irtithe^object^^ 

* A third way of measuring motivatipp, at rlpast achievement motiva- 
tion, whidi I am chiefly concenled v^h^h©^er^his morning, is to look 
. at behavior-diis darling of American pyscliolSgy behavior. It is what 
tiie person doe^that counts. It is not what hf^diinks, feels, or believes; 
it is what he ddes, and if rie works hardly he has a high achievement ^ ^ , 
motive* Why riot use diat as a simple method of measuring motivations .# 
, how hard does the pupil work? Well, again there are difficulties herov 
; Theoretical psychologists tell us tha^ prorformance is determined by 
. more f 4cfer^ ftan just motivation, so th^t 0 you use performance as an 
' ■ index |6f motivation, you get a lot of ^omer. things mixed in tiiere, too, 
; sudi ifts^past lemiing, intelligence, efer AI|other difficulty, even more 
;^ s^fciis; for motivational theory, is this^ A pfeoai may work hard for 
■ several dlflPireut reasons. He may work hard feeause he is anxious or 
^ ^<w^ie4 hc^ liewse ^ hp a high achievem^t motive. So perform- ^ ^ ^ 
ance pari tuiver ^rove'a veiy adequate method^f peasuring adueve- 

^mettf motivation, perse - 

/ ■ i " , • ' * . 
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mbtiyatton^Hrih we; have adopted/ l thbk.rt{ore or less by accidental 
must tell you tiiat a lot of Aipgs become clearer in the cold clear light 
f . of MnM^t fli^ &By: are^ at tufted | a|^ that five yeari ago 

when w^ began oui-' research on the alb^evembrit motive, we did not 
go ttirough thii step-by4tep an&lyw of.'bther measurement ^methods, 
. reject &Bm^ EmdAm, choose ifliefpne I ani going to^escribe to you. It 
ha^ened much iif ore accidentally t^ but now Aat we have , 

t * done it Als way, our fcie of reaspiriirig 16pks^j€psible to me, 
^1 What w© did to do cdi^eM mmly^^ op^mmgimtim belmvior or 

w fantasy, (1) Why'^did choose fmtasy\pf imaginative behaviorP I 
\ use "fantesy*' for my clinical friends, iind "iniagiifetive behavior" as a 
* > /Idnd of bow to mv Yale 'backgrQUtfdr nrhev rhean the ifteg fliingv 

I think our primafy reaion for choosing ^nt^y ;waiS^* 4t has so 
^obiidusly ^TOilcad*^Ps^ hfstor^in the cliii^cal ■ 

field in which free associatipri, and dr^km analysis irt die 

' 4 \ hands of dia psyeho^nal^^ts have led to VerV fruitful atod .^ro^uctive 
. motivatibn^^^yies^ If you statii off and look .^t tfi^j^^W psy- 
choanalyti^^^dlHonj^beginpin Freudj you can.Ovebimplify it 

i ■ by saying th^t it MaUy deals prirriarily witfi motivation; Fretid was not 
particiilarly interestled in le^Tiing or problem4dlying in tha great 
American tradition j ne was much more interested hi fndtivation, and 
I think ttie reason is partly metfiodolpgicaL What Freud t^tudied was . 
not problem-solving^ not learningi not how you get a pencil throu^ ^ 
a maze; insteadj he studied fantoy— free* associatio^ir-and because he 
■j studied this tj^e of bAavipr, I think he arrived at'i motivational 
1 / of theory, | r 

: ' * So we took this m a lead, and, of l^ourse, we had the support of the 

t Ipng Murray traditi|[n||^^fervard,*^ had shown sbme very fruit- 
: / • fiil motivational ^riSf^®^ basep on fantasy.; . 
V '[ One might discuss here w%^ fantasy should provide a good index of 
^ ^motivation, but I wfll Abt try to do iW iJt would lead me too far afleld. 
V : ^e sort pP*wgumeA >^ make''ij that fantasy is not influenced much 
by factu^ stateme^te/"^ knowledge ^I^ is not much influenced by 
I } I vailues, what a person cwght^to say in a test, as he doesn't have a, very 
,t ' i clear idea of what he OOTht to write in a Thematic Apperception Test 
So,' tiiie reasoning rons^^lhe only thing^tiiat is left to determine his 
4% C responses is motivatioii,^^y a process of^^fflusion. At any rate, we use 
^-yi^l .lintasy fOT wha^yer rea^ ^ v. 

L & . (2) TOiy contmt anatysisP Well,:con^Bpalysis, for my iMney, is 
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just a more systematic way of making a judgment The usual way of 
beating a TTieniatic Apperception Test record is to have a judge read 
the whole record and synthesize his impression of it into a rating. Wellj 
from my comments earlier about the complexities of such ratings and 
what goes into them, you can sep that I would Watit to move in the 
direction of a more objective nose-counting operation. A good analogy 
which I have often hid clearly in mind is the process 6{;making blood 
counts such as a medical technician makes: you get a sample of blood 
under standard conditions, you put it under a microscope with a* grid 
over its you count th^ number of red corpuscles and white corpuscles, 
etc/ ' ■ ^ 

Our approach is somewhat similar: you get a series pf thought sam> 
ples^ or sanrtples of imaginative behavior, and then ttevelop a. pategoriz- 
ing or classifying system. Then you count the riumber of 'times that a 
certain imaginative element appears. The. operation is a/simple yes-nO 
dichotomous type of thing, presence, ab|epce-the injagety Is either 
there or isn't there, like the white corpuscle. ' K; ' . " 

Well, so much for background. Now a little more in detail about, the 
.procedure, *We need three ^ings: first, we need a nietho4 pf cpUecting 
thourfit samples. Here we modified the Murray ^AT technique by 
o^^^ing brief written stories from subjects under grotip testing con- 
df^ns. In this way we can test as many people at once as you can get 
into a room in, clear view of the screen on which we project the pic- 
tures, in response to which the subjects write their stories. So it is a 
group testing piiocedure. We , put a short time limit on the story, be- 
cause we didn't want to give people extra credit for v&bal fluency. 
Since some people obviously can write long stories and others can only 
write very short stories, we limit the amount of time to around five 
minutes. In this time we obtain a kind of standardized thought sample 
averaging around 90 words in length. 

Secondly^ we need a-inethod g| scoring for the ach^vement motive, 
I will refer to die achievement motive in the Murray tradition as the 
need for achievement or more briefly on achievement We need sev- 
eral things ss prerequisites If or a scoring systeni /First of all, we need 
a criterion bjf achievement imagery; we have to tqcdgnize the white 
blood eorpuscle when w^ see. it/so to speak; we have to recognize the 
achievement imagery when it is there. We developed by a method 
whicli I will describe a little later, a scoring criterion which can be 
briefly summarized in this phrase, a kixid of c^tch'phrase that^M^e use: 
^^competition with a standard of excellence" Examples of it %an, of 
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course, be mulHplied. A person wants to do a good Job; he wants to 
beat somebody else. These are the two types of standards of excellence 
• that you find, the same standards that golfers use in match and medal 
play. In match play you try to' beat the other guy; in medarplay you 
try to beat par. Both these types of standards are Included under our . 
sodring criterion of "competition with a standard of excellence." 

Nexi, we need a set of related categories.' Having got the imagery 
criterion, we must be able to Identify other thought elements relatmg 
to this central category, ifere we tried very hard to get a related set 
of pategories that had some theoretical sense, that hung together. To ' 
do this we simply followed the standard description of the problem- 
solving behavior sequence that you find in any elementary text book, 
e.g. the process of adjustment. It is usually represented with an arrow 
for the motive, with a rectangle for the obstacle which the person goes 
around to get to the goal (represented again by a "detour" arrow), etc. 
We defined subcategories for each part of this behavior sequence. I 
will not go into more detail here, because I assurfie that you are not 
interested in the detailed definition of these subcategories. 

Thirdly, we wanted a method of scoring that was, as I said earlier, 
as operational as possible, as simple as possible, so that it could be 
readily communicated and readily used by scorers. Here, of course^ 
the ultimate test Is scorer reliability, the ease with which you get high 
agreement co-efflcients between two trained scorers. We succeeded 
pretty well. Our agreement coefficients ran around .90-.95 for scorers 
Judging on different occasions if they were well trained. Titining, in- 
cidentally takes a week for some people, longw for ofliers. Hiere 
seems to be an ability factor involved' In ease of learning to score 
such records. If somebody can tell me what it is, I would appreciate it. 

Fourthly, we need a scoring.«ystem that is as economical and simple 
to apply as possible. You may ask at this point why we didn't use a 
muIHple choice system so that a machine could do the scoring instead 
of a human being. It is obviously much more expensive to use a human 
being and much more tedjous, when you have got hundreds and 
■ hundreds of records to scofe; The answer is, of course, that we would 
like to use a multiple choice test but it doesn't work, and if any of you 
want to go out and try it,, all Lican Say is, more power to you. We have 
tried it and it has just never .worked. The same seems to be true of the 
multlple-choioe RorsohiaH.* Some day we will know why multiple-; 
choice projective .tests don't work. Now there is Just plenty of practical 
evidence that they don't. It may be because multiple-choice introduces 
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a reality factor which tends to minimize the iniportance of motivational 
detenninants of perception (see a receht experiment by CrutGhfleld^ 
and Posfcniaia in American Journal^ "Psychology on ttie effects of 
hunger on perception"). 

In any case our scoring system is not so terribly inefflcient arid un- 
economical. It turns out tiiat a trained scorer^ if you can keep iflm at 
it— which is another problem— can score 50 to 60 records a day without 
straining himself 5 and this means if you have ten scorers^ you can 
score five hundred a day. It is practical, in other words. It takes about 
a minute to score a storys & five minutes to score an individual record, 
which isn't excessive. 

So far we had a method of collecting the thought samples, a method 
of scoring them, and next we needed a method of arousitig tiie 
acliievement motive experimentally. Here is where we took a new step 
in the testing fleld, I believe. That is, we argued that we did not want 
to have an a priori scoring system. We wanted one that reflected sen- 
sitively experimentally-induced changes in acHievement motivation. 

So we began with two groups of subjects i rouglily, a control group 
and, a group in which the achievement motive was aroused. Our 
method wai to compare the imagery in the stories written under 
neufral cpnditions witfi die imagery in the stories written under 
aroused conditions. We found shifts in achievement imagery. Students 
wrote different kinds of stories under these two conditions, and we 
-■used these differences to arrive at the definition of achievement 
imagery which I gave you earlier. Note the importance of the ex- 
perimental variable in arriving at our scoring system. We used only 
those imagery categories which increased in frequency when the 
motive was aroused. In fact we redefined our categories so as to cap- 
ture as best as we could the differences in stories written under "con- 
troF' and "arousal*' conditions. 

Now for die payoff^ if any. You may wpU ask: all right, you have 
demonstrated that imagery in stories changes when you arouse 
achievement motivation, how can you use this to measure individual 
differences in achievement motivation? 

We took several steps here to seg whether we were able to measure 
individual differences. First, we did a very simple and elementary thing. 
Having decided on our scoring system based on the categories which 
increased when the motive was aroused, we simply summed these 
characteristics in a given persons record. Suppose a person \^tes 
eight stories: we went through and scored each story separately^ ae- 
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cording to the achievement motive scoring systenij and .then counted 
^e different types of ach^vement imagery which appeared in his 
eight stories and got his total score. People varied enough in this total 
score to give us a reasonable spread and we could begin to relate 
individual differences as measured in this way to other types of be- 
havior. The basic assumption is that if a person shows a lot of the kind 
of achievement imagery which appears when the motive is aroused, 
he must have a strong achievement motive* 

To what other types of behavior did we relate our achieverfient 
score? First and foremost^ as you might expectj we were interested in 
knowing whether the achievement motivej if measured in tiiis way, was 
related to performance, That I suppose would be thfe first question 
"any of you would ask^ do the students widi high achievement motiva- 
tion Work any harder? It seems logical that they should Our first ex- 
periments in this field were done with laboratory tests. If you take a 
simple test like, adding two place numbers and give college students a 
ten-minute repetitive test of this sort, you find that there is a very 
significant difference between subjects with high achievement moti- 
vation and diose with low achievement motivation. That is, we found 
that the ones with high motivation had a higher output of arithmetic 
problems; they completed more of them in the time allowed. 

Secondly, if you take a more complex task^ like unscrambling words, 
which is a relatively unfamiliar task as compared with adding two 
place numbers j you find that while the people with high and low 
achievement motivation start out at about the same, output level, 
* the ones with low, motivation do not improve during a^O-minute test 
period. Those with high motivation do improve, ^o that at the end of 
the test period they are turning out more work per unit time than they 
did at tlie beginning. In other words, they are Sufflciently motivated to 
learn new and better ways of unscrambling words, 

I suspect that some of you will bejnterested in whether or not this 
measure of motivation is related to grades. I am going to leave that 
until last, because it is a complex question, and treat it separately. 

Let me go on first to other types of behavior in the laboratory to 
which this measure of motivation is related, Take memory, for ex- 
ample. For years the problem of 'tlie better memory of incompleted 
tasks, the so-called Zeigarnik effect, 'has been something of a puzzle, 
at least to some psychologists. Why are incompleted tasks remem- 
bered better? There have beenj as you know, some conflicting results, 
Sometimes you find this effect and sometimes you don't. We found 
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that one of the variables correlated with better memory for incom= 
pleted tasks is achievrtnent motivation. Subjects v^ith high achieve- 
ment motivation have a better memory for incompleted tasks. Subjects 
with low achievement motivation generally have a better memory for 
completed tasks. They remember their successes, as it were. They are 
a little bit defensive about this, The onet with high motivation, on tiie 
other hand, apparently regard the incompleted task, as a challenge. 
They want to recall it so that they can complete it. They think to 
dierhselyes, so to speak, "If I had only had time to finish that, If that 
guy hadn't interrupted me, I would have finished it." 

Or take level of aspiration-something that you would th^ink mo- 
tivation should be related to. Here again we found a relatfonship, 
if you rule out reality factors. That is, level of aspiration, as most of us 
have assumed from the beginning, is partly determined by wish factors 
and partly determined by reality factors. If you ask a person what 
kind of a grade he expects to get in a course, he will be detennined 
partly by his past performance, by his previous grades in this course, 
and also presumably partly by his need for achievement. 

We found if you just correlate the achievement motive score with ' 
level of aspijation, you don t get any correlation, but if you do it when 
the reality factors are minimized, or are in conflict, when the subject 
doesn't really have any basis, for saying in reality what he will do on a 
certain test, then you get a very significant correlation with acMive- 
merit motivation. This, of course, is exactly what you would expect, 

Take perception. We have done experiments on the recognition of 
words with the tachistoscope, and we find, as one would expect, a cer- 
tain selective sensitivity. The ones with high achievement motivation 
recogni^f words relating to achievement more rapidly, 

Let me just mention two dthers. I could mention a great many^ore, 
and perhaps if there is time for a question period^ you can ask me about 
them then. 

A very popular test nowadays is the F scale, a measure of Authori- 
tarianism. Roger Brown at Michigan tested to see whether achieve- 
ment motivation was related to the F scale. I must say I did not expect 
any r^ationship, but to my surprise, he found one but it was inverse. 
That is, students with lower achievement motivation are generally 
higher on the authoritarianism scale. I think you will see why this may 
be so in just a minute. 

Some of you are familiar with the Asch judgment experiments. 
Typically he presents three comparison lines and a standard line to 
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six stooges and one non-stooge. The six stooges all say in succession 
ihat one of the comparison lines is the same length as the standard, 
when it is obvious that it is really longer. So this places the non-stooge 
in a conflict situation^ He has just heard six other students say that 
Aese two tilings are objectively equal and it is perf ectly plain that they 
are not equal So what does he doP Well, under these pressure condi- 
tions, about a tl^rd of the subjects, e.g. college students, fold: They 
yield to social pressure and call out the wrong line, 

Asch has wondered why some students yield and some do not. We 
r found, quite surprisingly, that the non-yielders, the people who re- 
fused to yield under this pressure, are the ones with high achievement 
motivation. There is almost no overlap in n Achievement scores of 
the yielders and non-yield^s, 

A reason for this can be found in our research on the origins of 
achievem^t motivation. What kind of home background, what kind 
of childhood training is characteristic of the people with high and low 
achievement motivation? A very nice thesis has Just been completed by 
Marian Winterbottom at the University of Michigan on this problem. 
It begins to explain how some of these things hang together. She was 
interested in the number of demands and restrictions that parents 
placed on their children, and at what age. She chose sons aged 8 to 
10 and she interviewed their mothers and gave them questionnaire 
schedules to fill out. 

What she found, to make ^ long story very short, is that the mothers 
of children with high achievement motivation made many more de- 
mands for independent decisions earlier than those with low achieve- 
ment motivation. For example, corisider a^j item she actually used, 
"Do you expect your child to learn his way around town by himselfS" 
This is one aspect of in&penc|fee training. All the^mothers said they 
did require this of their sdns.Wt the mothers of children with high 
achievement motivation said that they expected the child to ^now 
how to do this before the age of 8/ which happened to, be the median' 
age at which the disbril5ution of expected ages could be split. These 
mothers required more independence earlier^ in other words, there was 
great pressure from these mothers for independent activity of various 
sorts-crossing the street by oneself, making friends, doing well in 
school, etc. All of these independence-training needs seemed to be re- 
quired earlier by the mothers of ^sons with high achievement motiva- 
tion. So I tiiink you can begin to understand why the products of this 
kind of parental^ background would stand out against the pressure of 
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the group in the Asch experiment, why they would be more at the 
democratic end of the Authoritarianism scale, And vice versa^ you 
can see whj^tiie ones with low achievement motivation coming from 
a more protected background, would tend to be more dependent on 
otiier people of authority; why they would be willing to follow the 
.crowd, even when it is wrong, and so forth. I need not elaborate. 

Now, to turn to my last point, namely, the problem of predicting 
judgments of performance/ This is a long way of saying "predicting 
.grades/* and I chose die long way on purpose. Predicting judgments 
of performance is no mean trick, as most of you know. I do not regard 
it as especially difflcult in this case to predict performance, but to 
predict judgments of performance is quite a different matter; it is the 
criterion problem with which you are all familiar. 

Actually, we have done a number of studies of the relationship be- 
tween n Achievement score and grades in high schools, colleges of all 
sorts, and our correlations are sometimes high and sometimes low, I 
^remember when we first ran this correlation, for a college sample; it 
' came .SI. We were so elated that we nearly sat down and sent a tele- 
gram to Professor Terman saying "Forget about your intelligence test; 
we can predict grades better with a 20-minute projective test/' Well, 
it is a good thing we didn't, because we ran the correlation on another 
sample and the next tinte the correlation was zero. A healthy correc- 
tive for enthusiasm, the repeated experiment! 

To summarize this research the way it stands now— the Educational 
Testing Service will straighten us out on some of these things, I hope- 
there is a median correlation of n Achievement with grades in the 
Sffs with intelligence partialled out-significant, but nothing to get 
terribly excited about. 

Let me mention what I think two of the main problems are in get- 
ting such predictions of grades. In the first place, how much does tfie 
criterion, namely, grades, depend on motivation in a particular case? 
We have found a case— and I am sure you know of such cases^where 
the con-elation of Otis LQ. with high school grades is 0.90. Can you 
expect any correlation of grades with motivation if this is so? You 
may find one, but you certainly aren't going to add anything to the 
prediction of grades that you get from the Otis I.Q. alone. Maybe the 
teacher just looked up the intelligence test scores and graded accord- 
ingly. .? " ^ 

How much does the grade criterion depend on motivation? And how 
much on the teacher's idiosyncrasies? Langlie and others showed 
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twenty-flye years ago that grades in high school, at any rate, and I 
suppofe in college^ are conrelated with teacher judgments of otiier 
personaJity characteristics, e.g., attractiveness, physical maturity, and 
other characteristics of that sortp So there is certainly impurity in 
the criterion— e.g.. Judgment of performance as compared with per= 
fonnance itself. 

The other problem that has been very puzzling to me is whetiier 
or not it is redly legitimate to parcel out intelligence. TTie nonnal way . 
of proceeding is to eoirelate n Achievement with grades, intelligence., 
with grades^ and tiben parcel out the correlation of intelligence with 
n Achievement, Therrf: is always a positive correlation between 
achievement motivation and intelligence, and there ought to be, it 
seems to me, T^e the extreme case, Whatever die native abijity of a 
person^ if he has no motivation to lea^m, he is not going to get a high 
intelligence test score. So it seems to me there ought to be some cor- 
relation between achievem^t motivation and intelligence test score. 

There are two placesi where motivation enters into an intelligence 
test^ scores one in the accumulation of knowledge which he shows on 
the intelligence test or achievement test, and die other in tiie attention 
be gives at the time fee takes ^the test. We know that people who have 
high achievement motivation will actually do better in die testing 
situation.' So there is an intertwining here of achievement motivation 
and the intelligent measure. Is it fair then to partial out in relating 
motivation to grades if we know motivation also determined tiie I.Q. 
to some extent? If we' do,r^e are eliminating part of the eflfect that mo- 
tivation has on perfomiAnceVlf we don't, we can be accused of simply 
finding a con-elateldE I.Q. which therefore ought to predict grades to 
some extent. Itif a^d^cult problem to think through— the relation of 
motivation to pdrfprnynce and intelligence, but these are our con- 
tributions to it t<j^fig.lAt least we think we have a method of measur- 
* ing motivation' whiw sliould provide plent>^ of food for dlought, 
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participants 

Philip Ash, Edwin G. Flemming, Charleb Langmuir, David C. 
McCt^LAND^ Joseph Zubin, 

Dr. Zubin- I believe we are all deeply indebted to Dr. McClelland 
for a very timely discussion of a problem that is fpicing research in 
personalityt I have but three comments. . / 

First, about the technique for elicidng achievjemfent motivation 
which was used. We were not told exactly how it fwas done, but ap- 
parently it had something to do with the inh^odaction of incentives 
for achievement in the one group and.no such incentive for achieve^ 
^ent in the other group. Df course that is a very school-like situa- 
tion; and one wonders whether that kind of exp^imental eliciting of 
motivation bears a high degree of relationship to /the wide variety of 
fatets that motivation consist of. It may very well be an important 
aspect of motivation, but that it encompasses the entire variable that 
we regard as achievement rnotivation is doubtfuh On the positive side, 
when one begins to tackle such a field, it is good to separate out the 
different facets, but whether the kind of achievement^motivation elic- 
ited in school is very important for achievement-motivation in life 
remains a very important question. 

I also wonder whether the very simple test he usf d^s an indication 
of motivation, n&mely the increase in rate of simple additions under 
incentive conditions, had previously been used by the Character Edu- 
cation Inquiry and by the Spearman School, for the measurement of 
motivation— I wonder whether that might not give as good^ a correla- 
tion with degree of achieveinent motivation present during tiie experi- 
pient as the dissection of the persons imaginative production on tiie 
TAT would yield. ^ 

This very simple task of simple ftdditibn may give you as much as 
the more complicated analysis. - J 

As to my second point, the technique utilized by Dr. McClelland 
essentially consists of utilizing derivatives of the projective technic 
method. This is a very worthy derivative. We have been able to demon- 
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stoate not long ago in our own laboratory that when content analysis on 
TAT-like pictures are used tachistoscopically Jn the specific focused 
situation involving interpersonal relationships and the contents of the 
response for that particular variable are scaled, we find tremendous 
differences between the performance of individuals who are normal, 
diose who are neurotic, those who are chronically ill mentally, and 
fhose who afe only in the early stages of illness. The idea of using 
derivatives of the TAT or^of other projective technics in a motive^ 
focused manner and then scaling the results along the dimension under 
investigation is a very worthy one, and it has been applied not only 
to tiie TAT, but also to the Rorschach. Dr. McClelland's findings add 
much weight to tfiis approach. 

For example, when the content of the Rorschach is analyzed on 
specific sc^es for measuring dimensions of content involving such 
variables as cheerfulness, anxiety, sociability, etc., significant correla' 
tion is obtained, whereas as you know, ordinary clinical scoring of 
the Rorschach gives very low correlations with such personality. vari- 
ables. Thff whole method is part of a very healthy approach of trying 
to make sense out of the chaotic field of projective technics by singling 
out particular segments and focusing attentio^i on the particular pcr^ / 
formance related to ttiat segment 

The third point, and I think Dr. McClelland will agree with me, is 
that he has defined motivation in a very narrow setting. He has limited 
himself to what you might call, for Idck of better terms, rivalry and 
competition, competition with norms, rivalry with others. But there is 
more to motivation than just that. Certainly professional motivation, ^ 
if you Hmit it to these points of view* gives you only a very small part 
of the picture. What about^operation as a motive? What about curi^ 
osity as a motive? What about altruism as a motive? 

All of these motives /are lost in the particular sector that Dr, 
McClelland has selected and I do not mean to say that therefore his work 
is not of vdue; it is of tremendous value. I believe however, that we 
should not be surprised at the low correlation between schooUachieve- 
ment and his achievement-motivation score because he has not mens-^ 
ured motivation in all its aspects; he has taken two aspects of it which 
perhaps unfortunately the American scene stresses unduly. The other 
aspects of motivation may not be as strongly developed in the average 
person in our culture, but that they do form at least part of the achieve- 
ment-motivation of many people cannot be doubted. Perhaps finding 
tests for measuring th^e latent motives'may hasten their development. 
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DRAMcCLmLAND^ Let me make dne comment I am freqMntly 
accuidd of not doing things that I didn't intend to do in disci^ions 
of iim partfe^ar method, Of course we didn't measure the curiosity 
motivb, we didn't intend to* Alsb there are lots of other motives that 
certainly can be measured, usijig the same method. I perhaps should 
have made that clear, I tiilnk the method that we used here of concen= 
tratirig on one motive— and I would^ even agree that it is one aspect 
of one motiye— is one that can be generally appliedj and I think that 
is the main significance of what I said here today, 

We intended to deal only with one aspect of motivation, and I tfiink 
that final test of whether that aspect was worth fconcentrating on 
or not is contained in the'^twenty or thirty relationships that we h^e 
b^^een it and other important variables, That is the ultimate test of 
usefulness of any analytic approach. 

I remember when we first started doing tliis^ five years ago— six years 
ago^ now.^ Dr, Rapaport said substantially the same thing that Dr, 
Zubin said, He said the motive you arouse in the laboratory has noth- 
ing to ^o with achievement motivation in life; you are wasting your 
time* I am glad I didn't listen to him, 

Dh, Flemming: I wint to ask Dr, McClelland whether he has cor- 
related tiiis achievement test with pcactical achievement in the work 
situation, such, for instance^ as the achievement of salesmen, 

QiAmMAN Bennetts I take it you all heard the questipn. Dr, 
McClelland says the answer is no. 

Mr, Langmuir: Could Dr, McClelland give us any information about 
the variation in his measures of achievement motivation under different 
conditions of arousing it, or over an interval of time in successive 
tests oP^he same individual? 

Dr. McC^lland: This is a complicated question in a way. The 
stability of die achievement motivatiM measure as we now use it is 
not ordinarily high* People who are use3> to scoring intelligence tests 
are going to be alarmed at this. If you retest a person weekly^ or six 
months later, you do not get as high test-retest or reliability cofiBcients 

you ought to get, at least as we are used to thinking you ought to 
get That is, they run probably as low as the sixties and seventies 
rather than up in the eighties and nineties. 

There is anottier way of looking at the problem. It involves the 
whole question of the relationship between validity and reliability, 
The other way of Iookii;ig at it is that if the measure was stable^ or 
more stable, it probably wouldn't be as sensitive, In other wordSj in- 
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Wnsi^ly^ motivation is something which does vary probably more 
from day ^to day and week to week than intelUgence does,. since we 
more or less assume tiiat intelligence is something which remains rela- 
tively stable over time; at least we try to measure it in a way which 
yields stability p MotivatioHj on ttie- other hand, is sometiiing whiGh I 
would say infrinsically varies rhore. 

Dr. Ash: 1 have a question that relates to one asked previously. 
I gadier that in its present form the test has not been used in the 
Industfial situation, but 1 wonder^ first, whedier as it is iKed now, or 
as it can be used now, It might be appropriate in iadustrial testing. 
My second que^Mfii^ would be, has it bfeen used to observe relation- 
ships between need , achievement and^uch variables as leadership be- - 
havior and group acceptability^ f Or example, in line with the w'Ork 
done by'Shartie at Ohio State. ^ 
^ Dr* McOlellAndi .My answer is no/ although there were some 
studies done that are a litri^ bif relevaht to this problem dovim at the 
University of Maryland^ Field did a study on the effect of social rejec- 
tion on the n Achievement score and he found very serious sex differ- 
ences. I havei^^ mentioned the sex differeni^fjwhidi appear with this 
test; Tiev toe' very markedly dependent on thd ^pe of arousal, and this 
support t^at Dr. Zubin said earlier. I agree with most of what he' 
said. It is oWly Ittiat I was trying to do something different. 'I am afraid 
I sounded as if I did not pgree vylth hinftv.t'dp^ because we know that 
different arousal conditions will j^bduce diffllrent effects^ , ftn4 the big 
sex difference is ajnajor case in'*point. For faxample, the^-affiievement 
motivation score of women does not increase under our normal arousal 
conditions. We thought the women word very refractory; we tried and 
tried it, again ^nd the men's score increased every time, by our'scoring 
system, but th&vAvomen's score didn't. 

Field, at Maryland, did a study in which he rejected both men and 
women. That is, they were told that there was going to be a sort of 
popularity poll and that they were going to learn the results of it. He 
handed them back slips of paper on whjch it was clear that they were 
in the group or out of the group, rejected or accepted. Now, under 
these conditions, the womens achievement motivation score went way 
up if they were rejected* After we discovered this, we found that Else 
Frenkel-Brunswik had shown this years ago in her motivatioh study 
In which she found that high achievement motivation, -as rated by 
teachers, was pretty closely correlated in girls with appearance, dress- 
ing, and things of this sort, with the social side of the achieviement 
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motivation, whereas' iii meh it is more connected with leadership an^ 
intelligendfe 'j(lhe fad^rs referred to in our normal arou^^l.cQnditions). 
A^arently it doesn't l^eaten a woman nearly as much to eaU her 
unintelligent as it does to call a man unintelligent, ^ ,\ ^ ' * ^ 

I have greatly bversimplifled the nature of the achievemexit mpflve. 
I w^t to say that I am afraid niy earlier remarks Mfere not as; sbrioui 
as they should have beeh; TKere are all kinds of achievement motives. 
We knoWj for example, that there arc some people vvho are charac= 
.terized primarily by a hopo\of sui^ess^ pothers by a "fear of failuire. L 
did not have time to discuss all these va^iajipns, ' ^ ' ' 

There are some whose achievement motivation is fociissed on ath-^ 
letics^, or playing bridge^ or being a fion Juan. So I certainly haive to 
agree with Dr. Zubin that the motive is much more complicated than 
what our simple^ over all index shows^ The research prohlcms remain- 
ing are very great indeed. 
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j' « WHAT ABOUT THE SAMPLING? ; : ^ 

. ^ • \ A BIT OF PSEUDOHISTORY ' 

iVupposa iJl of you are well awaft'bfiij^ fte smpUng 

methods Aatour poJlinyorganizatiQns are i^ii^ this year have evolved 

Jrom quite primitives beginnings. First of ^1; oLcojirse, tiiere was a 
plant life period or stage in which a ^cwp|ta Ber reporter asked the 

.people he found around him how tiiey were gping to vc^te* He sent 
out m roots like a plaiit and got what he could from thd spot where 
he happened to be/'^Ihen came, the dinosaur stage, the huge mail 
canvasses of ten million or more post'Card ballots that were sent out 
by tiie '*Llteraiy Digist^* and simiia^wrveys made by otfier publioa- 
tipnSj ending up in thfe}^wamps. They^ wWe bogged dovm by the mere 
size, and if anything tiiey' proved that size alone is not suflBcient for 

surviv^. \ ' 

Next, if we can j6mp a. million years, more or less, came the davm 
of civilization, and with it; came men who had at least thrf rudimentary, 
beginnings of An, alpha^#Wroy roamed over their hunting grounds, 
capturing big, lumbering elephartts and stocky, stubborn donkeys in 
order to put on a great race. They laid heavy befi on the spectacle 
and give odds, apd everyone t0d to dope out tiie outcome before- 
hand so Aat he could bet on a sure thing, -J . ; 

They took their f>Vimitive alphabet and ma^d Ihe fattest animals 
A*s, the good and nhrtTO ott^B's, the middft weights Cs, and the 
scrawny ones D's^ so4hey would fee a f mixture of economic, levels 
and have a good race ^ In like m^er they gathered a proper mixture 
of meJes and femdesj old and young, mve dwellers and denizens of 
the forests. They did this every four years and tiiey had great fun. 
But in tiie end it %vailed them naug^, for after they flourished for a^ 
while, tiiey foo Jpold, and in 1948 B. e,, t]^ey plunged wi^ the ' 
biggest bet of all time. And when tiiey lost ^r ^^Ms, ^^d all Aeir 
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lollov^g had Ukivdse, th^ra arose a great wai^ng" and a Joud cry, 
"Oh, what have we donej or ffiikd to dop, oh, Lady Luck, that you . 
should desert us nowp * 
. Aud flte y apprua^hwd tfie-sooftsayOT 

the aooUiaaymi iaidg **You were not careful enough^ in most of what 
you did,, and you diould have done much that you neglected to do, 
but most of all, you didn't pay attention to Motiier Histoiyj o/ learn 
your l^sons ^6ut the probable ©rtor,iand the last mmute inift, and 
the mysterloui ways of the undecidedPud evasive, and4he wiles of 
th@ editor who wants you tg do your sfunt way out on the limb^ And 
Ae heartbreak of the photo-finish.^ And they went away sad^ and 
ifepentent. . . " ^ 

" Well j now, /ie year has come aroimd for die ntxt race* The pollsters 
hav^ mwaged to gather die animals again and all die followers of . 
Nlo&er History bib asking, *Will tfiey go off chasing Lady Luck 
l§i^P Have they learned theij lessonP What about those pitfils?: 
wiat about the sampling^ ^ \ ] 

Well, now fellow cave dwellers^ no one can tell whether the pollstats 
will tag after Lady^Luck until tomorroWi or perhapi Monday inpniing 
when die final call comes for placing the big bets. They say they won't 
be betting dns time, maybe never. Here (holding up *a lettt^r) one of 
them says, "Us? We are ndt predicting " ^ ^ ^ 

They hope that other people won't use dieir reports as a racing form^l^ 
and lose money on foolish bets, but still they say j; "We will tell you all / 
we l^ow^ everydiing we know about the aniinalij and then it is u^ ' 
to you.** . ' ^ 

'^ey have gone father into the dawn of civilization and diey have 
learted a great ded about numbers and counting, at least on one hand 
(counting flnge^sX 1, 2, 3, 4, 5, aiftd they have learned a new religion^, 
called "Probability Sampling" Some of them have embraced the new " 
religion ,whift keeping a few. of their old pagan beliefs. Some others 
are trying to use die same old prayers and magic again for they hear 
diat the new religion is a strict roaster and that it exac^ a heavy priie 
* before, it will help them in any way* ^Tiatever their present faith and 
doubts, none of diem seems confident of his dope on the race, or 
hopeful about die benign intervention of the supematuraL Thf ^ are 
sayipg, and will probably continue to say, "its anybody's race,"^ 

That is as far as I will go In predicting what Sunday mojaing's or 
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WeH what aboflt smipltag, ttianf Some of ttie |tate and locd polBng 
orgml^^oni, md quita a miinber ofifewspapars, appear to be using 
Vi^ miigh tfie iame meAods, at least bo far as sampling is conceraed. 



*Itei^ of Qome, the same old half-serioui shmts: Aa chicken- 
^ i»oU is on again, but we don't know whether it is the farmers or 

fte toi t^at are m^ng ttia big decisions about who is going to be 
^ ^elict^. The dgarette poll i^ on. It samples smokers^* ignores non^ 
-smokers; TOdoubtedly a bias ri^t tiiere, 1^? taxi driver poll is on, 
ft^arbetpslT^antfSao doubt the astrologers 
1^ /Tha Importmt^prbblems of sampling do not center ^in thaie side- 
*shpvf ifl tiia elation circjis. Neither are tiiey to be fdu'nd in those 
r^[y serious cfmvasses tiiat are operating in cer^in instances as if 
noAlng happeTOd in 1948, and in other instances as if the important 
'^thing is not to^^ to pick the winner, but to redly find ou^ in a 
^ genn^ly scienMc way something more th^ we presently know |toout 
poiiTO&'behavior and the way in which people make up thafc minds. 

Th#y ate to ba found in the more noteworthy surveys of opiftion 
anl election behavior that are nqw being made, some of which have 
y not coma to our attention because their results will be reported more 
deliberately after the election is over. However^ some of the better- . 
'known 'pollin| organizations ate also contributing mye than they 
haw previously to the more serious, long-range study of election be-^ 
havior. TOerefore, even though there ^ may be few forecasts, the '' 
sampling problems associated with their w^k is sttll very important 
for we mustap^aise the results of the polls in so far as they offer any 
possibility, of increasing our underitanding of how peoj^Ie think on 
Issties rod how diey decide to cast Aeir ballots on election day* 
The an JysiSfrOf the details of the sampling operatibns as they ,^e 
■ actually carried out by the pollsters is a major undertaking that none 
of tis has attempted so far. I would like to stress that It is not some- 
dilftg Aat you can do in a day or that you can do witfiout going 
> ^tfufdugb a great, deal of material that is available only ih their offices, 
'^^Some of. tte .material we would need isnt even available ^ttiere. Tp 
judg#how the sapling operayohs are- really working out now com- 
pared with previous operations was difficult enough in 1948^ it is more 

tcult today, I think Neverthelessf I will attempt to make a few gen- 
qbservations on the methods tiiat are being used by die more 
^prominent of A© polling organizations. They will glpss over the d%. 

tails and give fii juftT^ad outlines. 

i ■ ■ ■ /. = ■ . - ■ ^ 
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Hie prindpd ^a^ei Aat have Been made since IMS in die selic- 
tion of Ae sample o^eople tQ be Intervlawed have been made in tiie 
direction of aisigmng to interviewers a lelection of city blocks and 
spcifle directions on Kbw to do th^ InteM^ 

instructtoni ipecify the stpting point on each block and teU how the 
interview^ sfiould count off a desipiated number of housefidlds from 
each selected household to pidc Ae next household in which to seek 
. an interview, re are ins^ctions about how an individual is to bp 
saleeted wi^n each household ttiat is tiius chosen for the sample. 

Iliere ure also some ganeral anangementi designed to get wo» of 
the interviewing into the period when people are home from worK in 
order to reduce fte losses tiiat occur in daytime interviewing, "niis 
procedure replaces tike old quota sampling procedure m wUch ttia 
interviewer had a relatively free choice of respondents so long as he 
satisfied certain quotas assigned by economic level tod sex and fol- 
lowed certain g^eral instructions about obtaining a representative 
group of respondents to interview. iTie relatively new procedure of 
'*bldck sampling* is actudly forcing interviewers to go into areas in 
cities that they had avoided before, or missed ^together* It will 
probably remove much of the bias in economic level and education 
that was characteristic of previous polls. How much, we can't say, 
but it seems to be a direct consequence tiiat it should have that effecjf. 

However, there is a general disposition among polling organiza= 
tions not to require interviewers to make additional calls when they 
find no onp at home at the first attempt, In some instances there is a 
provision for substituting a neighbor, but in other instances no at- 
tempt is made to replace or to regain interviewing attempts that are 
unsuccessful in the first instance. In addition, tfie older quota sampling 
methods are still employed in some of the rural areas or in other situa- 
tions in which the block iampling procbdure^is difficult to apply, 

There are some other types of smipling that are being used, such 
as Gallup*s pin-point methadj but they rej^resent supplements to the 
main samples rather than the principal sampling procedure itself, 
If there were time enough and you were interested in such details, it 
would be quite appropriate, I think, to examine the ingenious attempts 
that are being made to find a way of sampling that is not as costly and 
troublesome as people think that probability sampling is, and yet tiiat 
avoids some of the weaknesses of the older methods. 

Back in 1948 a special committee of the Social Science Reiearch 
Council made a comprehensive review of the polls shortly after the 
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dection. It oondudad in 1^ consldaimtioii of die iampUng problem Aat 
Uie av^i^Ie midmm v/m not adequate to meaiuro extent to 
wUdi qu%»to sampling contributed to &e iyitematic ertots in tii© 1048 
BteettOfl^lWteBSfif , othif handj the wamtaation offlie siumpla^ 

aducationid distributions luggested Aat ttiere was a coniiderable 
systenifiLti[o enror In most of the quota samples^ ^though Ae amount 
of it^ tiie magnitude of It^ Midd not be measured. 

Ilia report also held Aat there a poiiiyiity of improving the 
sampling meAodS; but emphasised tiie fact Aat numerous factors 
other ^ban die s^pUng error conMbuted to the gross error of pi^e^ 
dieting die di\dsi6n of the vote among flie cmdidates. It warned 
speQifloally ^at the use of probability samples will not in any way 
guarantee that one cbh predict elections. Hence, we may expect that 
tfiese current changes in die direction of probability sampling will 
improve the acouraoy of die polls somewhatj but will not enable them 
to succeed where they failed in 1948, . ^ 

In tfie absence of a deflnite analysis of their accuracy^ we must 
assume that the current percentages are still subject to a degree of 
error^ from sampling alone, of the order represented by a standard 
deviation of perhaps two or diree percentage points* This is little more 
than a gaess based upon ^e past performance of die polls and what 
we know about the general oudines of the sampling methods now. 

Including other sources bf error the gross or total error may be 
of the order represented by a standard deviation bf 5 to 6 percentage 
points* NoWp this doesn't mean that they can't be, as we used to say, 
"right on die nose," but as you all know from, the applications of the 
theory of error in testing and related flQldSi what is important is not 
the possibility of being exactly right and having a zero error; it is what 
the long-run experience of a variety of errors leads us to expect. The 
guidance we cm get by assuming some approximate value of the 
standard deviation is an important element in reaching a sound judg- 
ment about die meaning W the polls* These guesses may well be 
exceeded, in die case of samples or sample resulft that are based on 
fewer than^ say, a thousand respondents or polls dial are subject to 
more dian the average degree of en"or. 

Therefore^ when you examine any of, the results of die poUs, please 
increase each, percentage by ^ percentage points or more, and also 
subtofact from each percentage the same number of points, then look 
at die b¥0 flares you get and draw your conclusions from thm assump^ 
tion that if you continue opeAting in this way the percentage you 
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wish you knew will be cau^t between ttieie two figures about twiea 
as oStm aa It faUi outside, . v 

You know bow to adjuit ftis if you WMit to diange the odds of 

you are saeldng fc . And ^en dotft 



forget flial ffiout one-tiurd of Uia time even tiiis crude way of making 
aUowanga for ^mmca^oy wOl lead us to underestimate Ae actual 
ff©B8 eiTOr in die pjiirtictdar pfirdentege we bave before us. 

\\^at Utout $amplmg» ;fflen? It bas been improved^ but not m ttmdi 
as it could be. It exhibits idl the practical protilems we ancounte^when . 
we attempt a large house-to-house survey in my fiel^ but ^s isn't 
the MalEt ifeason wh ttierpoUstirs cannot tell you clearly who "wOl be 
elected^ or vcty accurately how different groups of tiie population 
differ in tfieir reactions to issues and candidates. That is a stoiy for 
the next two cave men to tell y^u* 
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INTERVIEWING 

Aftis toe polls failed in 1948, my colleagues here, and I, took part in 
an investigation of the polling organizations. Hiis time it appears that 
we are t^ng no chmces. We are investigating tiiem even before tfiey 
have had any time to fail I might say that the three major polling 
organliatioris have been very courteous td us and allowed us, in a 
saqf 4 to j6 ttiis detective Job again and we are most appreciative. 
^ . ; *:ifi>uifihg t^at earlier investigation^ the story \yent around of the inter- 
' ' VifiMto w to her agency and remarked that she knew why the 

p^ll^ had faflW. She aescribed her experience in interviewing on the 
^^aleoMonsWi'said dmt she had ttiis strMige experience: she kept run- 
y i;!^!!!! irifb res^ndents who continually reported Jthiat they were plaii^ 
"^''^ '^g^tg yc^e £^ Trumaij, jniis ha so often that she knew some- 

thing wJ^^^ng.iSbe^firtteed something was wrong with her inter- 
viewing or her^^yw^fegj and so she threw out some of those eases 
and did some more interviewing until she found enough Dewey sup- 
porters, 

I suppose tiiat this story-which, no doubt, was invented by some 
wit rather thanlbeing the real truth— is a kind of dramatic illustration of 
the contribution that the interviewer conceivably might make to the 
' success or f aijure of the polls to predict an election, Obviously the 
sampling andr die resewch design and tihe analysis can be f ^prt^ but 
in so far as die raw data that are collected are inadequate, of oourse, 
the error is implidt in all the later predictions, ; 

iliis possibility of interviewer error is perhaps the reason why I 
was aasigned die topic of changes in Interviewing nlethodology since 
tiie .'48 polls. In actuality; ^re is very little to be said in die way of 
anydiing new on the pfoblm^ for the polls have made no real, radical 
changes in their Interviewmg procedures or their interviewing staffs 
and the detailed story of this aspect of survey research can still be 
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found in the Soctal S^findeJ^ Bfe^fiflMi^^^^ S«ltom No. 60 on the 

1948 eleotion mvaitigs(tidtt. tnfgitf^y^ tite^k^^^^ A^ there is 

a detailed ch^ter ^^i^^ttej^^tf magnitude of ©CTor 

created tluttugh^e ini6CTip^S§?pJ^^ Stephan pdirited ^ 

out with reipect to detennine exacdy how 

' much Ae interviewefj^lQa^p^^ Them was 

some putative evide3ac^_ ;^ti^'fti|^TO let^ say, a rather 

; , smarfpiutof it -/r'i.J'Vf.'A-f.li'.'j^^^ 

NoWj dils absend# pf ^i^^a tttitiie ?^ might be 

'regarded by you. {^'jpi^^ig^^^^^^i^ of the poUing agenciei. 

\\^a I do not wantrtic^ii^ftdpnev^^vj^^ on their part, I might 
describe to you certain," jfeattifes ^ procedure in one 

of these major egi^nei^^.W^feSph^h^ for the fact that there 

has been a persirtippite;ydf.:i^ Incidentally, this in- 

formation might b^^Jrf *BOirtrt to you apart, from Its 

f elevance in eKpWpiflg'ttte^ 

Inter\dewirig in .jffie^ n^ot be conceived in the 

image of intei^'0vWn j i^^^ study or in *die image 

of mterviewingf c^c1|flidal»^^^ ' 

Interviewing, in/jfi)^^^ very massive field \ 

operatton; eQ^ii^tjtelf^|njtei^^ P*Qw3y paid em-^ \ 

, ployees. Th$iT j^tei .,c(f j^kj \ pm b«Stween^ a doll ar as^ fa^ dollar . and a ' 
half ari JiiOiiFK^e 'iwrrtber^^^^ OTgaged In anjej|eofeion 

' / survey woul4:! x^^ but' woMld* ruA into gerfiaps/bvo ;! , 

i hundred ojpV'itf, int^ featur||[^!'^ the:\g> , ^ 

^ research agency itsetthin^ any change, in thp' cortpo^ition^ ^eDgT^phi- * . . 

* pal.disbibu^on*'; Qf- ©pe^ pf such a 'staflE* ndV^a^r h^w wpll ad- ^ ■ 
« yi^ed " sudi , phaptjg^sJmiy be; f dt: ^ given election p'i:edict|M;|Urvey J; ' y • ' • ■ 
* ' First amoi^g thip^ is &e :fa^ all th^lectionrpHKig agencies 

' conduct ;tiiese'^arti0^^ side line. T^eir /maj^ ^ork, exoajfil^^^^^ri^ ' Vi* 

for an p^a^idna^tounn^ years, consists of iftaricet resear^jV- 

and a ya^ety bf^ bpfm^ The character of the field-st^iff oj^ra^^'^^ 

yon is wid intistyB^tessehtia^ by these mpra tibhtininEtg needs ; .i /'{^i S^r- ' 

rather tiiaii' ^y tiie "s^pifib ■ needs associated wiHi effective Election : v^^^- ' 
.predictipiiSi the sample designed f ^^redicti^ X l^ 

an ele^tiottr ml^ht/c^^ in ceftain areas "i^ certaiii -f ^j' V^ 

strengt^^J^as^^oj;' exaittpl^v'in an estimate of critical stattey'^s /f-^/r^^;* 

y^hich'ffrerigths wduldi otherwise not be heeded for the rest of the.yea^ v ■ . * > - 
;'OrfQrit^^^.fbtt^^ ^ " ' / f . 

^ Similarly; jSie sM situation might call for a certain political" '\ J . 
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oon^^ition in fte ititf , beeausa of tiie possible ideological bias of the 
interviewer, but Ais sfune political composition ifiay be in^elevfent for 
most other market researdi purposes. ' / ^ \ 

~^ ^Kafical alterations in Hie TOmpdsition of tiiis rtaff d rfeplace* 
ment o)f firing or necess additions involve a ratiier considerable 
ejcpenditiire. While the cost of . recruiting and trunirig and supervision 
^ o£ a single interviewer is difficult" to deteminej we might, .for our " 
* purposes hate, set the figure at $7S per unit intarvieweri which would 
be a ridiculously conservative estimate. Now^ tiiis flgure may appear 
negli^ble to you^ but, when you multiply this two J^undrad timaSj 
you find that the usual agency has ,an equity of perhaps $15,000 in its 
ciirrent field oj^ration, a preperty not easily Jeopa^ 
die fact that tiiese agencies are commercially run* But ^yfih^^whera 
change may be ^lled for and prganizational factors su^hMs^J^^iiave ^ 
i' mentioried are ignored, there is a great difficulty in making funda- 

mental changes in die composition of *iurvey interviewers, For example, 
it is interesting to note that on the present continuing permanent field 
staff '.qf vdie Roper agencyj there is only one lonely male inte^ryiewer 
' r put d|pilh^p^ All the rest are women., ; * /\ ^ 

- ^ ttis sitii^tiorij I can assure you, does hot represent the agency's 
libido at work, Ihis is a product of larger institution^ facfcrs that 
^ ' ' affect the type of individual that is available in the^abor market from 
which interviewers can be recruited for survey research. 

in die course of a det^led investigation of interviewing and survey 
research tiiat die National Opinion Research Center has been engaged 
/ ^ih und^r SSRC and Rockefeller Foundation. auspices, my colleague, 
V- Papj Sheatsl0y, conducted^a very intensivp.study of diis labor market 
■ - f pi^int^MewefS^ such facts as thfe followin|: while there 

is^'^oriiB viiriabfllty^ within this market, arid the staffs of different 
agencies shoV different .profiles, there is a'modd type of interviewer 
available for hire, No matter what agenfcy pjr which time period is 
studied, flie college educated comprise aboyt three-fourfts of all die 
staft^ women at least two-thirds or more. ' , - ^ V / { 

The Negro i^erviewers, whom NORC was able to hire over Ae * ■ 
past twelve years, are an educational eUte. They are far better educated 
even than our white interviewers, about one-third of them_having 
.'done post graduate work beyond college* Tlys is quite interesting when 
yoii consider die fact that they are interviewirig by and Jarge a leg- 
ment of &e population with far less leducati 
' One o&er fading by Sheatsley is of interest. TTie rigidity of this ^ 
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IwLhpT mi^ket was examined by comparing the characteristics of Aose 
iptendewars hired befwe World War II, during tiie war, and in 
periods since tfie Ww, Despite 4e massive population shifts due to 
warBme Mctb^ 

stable ovtr dl Utis timii, ^nie same stability is demonstrated, no matter 
which supervisor Wes to recruit interviewers in any area, For tiiese 
organizations^ and institutional reasons, the composition of &e inter- 
viewing st^ used fhow no major change. Nor has there befi)^>^ 
major change in j^^^^aining di ihme interviewers or in the J^tuU 
^toduct of the in^Wiew* / 

Such procedures of training are fairly, standard, fairly rigidly insH- 
tutjionali^ed in the agency, iure regarfled by them fts wo^ng mod- 
^flbly Well to insure quality, and they reason that ttLeipjr&blems of 
^^Vdiction relate much mora^ &e realm of sample'aesign or to the 
realm of conceptualization and analysis of voting, preferences. These 
are 4e problems my colleaguep are addresiing themselves to. 

On the score of training and supervision, I should report, however, 
improved methods of quality control For examplf^;Crossley has de-' 
veioped a procetoe which lie :#t to in 1948 of checking the per- 
formance of each interview6r b:^ %nparin| results 
on a TOries of demographie charartfcrte^C^^ that 
eharacteri'stib, for the same sample point Any fiWjor discrepancies 
between the ;results for that interviewer and^tiif :^lttf ion data imply 
eithCT erro^ 'Ar, what is .worse, cheating, tiiat is^^ fljat the interviewer 
fills out th^ answers himself in tfie privacy of his horne/Unde^ w^' 
conditions, Qrossley institutes some disciplinary acBpn again^ tijis 
interviewer.. . ■ • .- '} ■ - ' .■ 

Roper similariy /has expanded a system of contfo^ J.liiVplving 
regional supervisors who report on the quality of pe^prriiw^e 
interviewers under tiiem through direct observation/fmd^^^^Ugh flU- 
ing out detailed; fating sheets, A description of that prdpi^aa^ is given 
fa detail in tiie lummfer 1^52 issued the P«bHc OpMdh Quarterly, 
Consequently, there is reason to b^i^evp that while the interviewefs 
have not changed in character, they ;^eiinder somewhat better eont^ 

Here is also reason to feel that tfey^^e; a pretty highly experienced 
staff for tha type of field problem they ^e: encountering. Roper, for 
exwnple, had one interviewer who recently died after a length of ^ 
service of sixteen years. Crossley has at least Jfteen interviewers as of 
Ae present who hav6 had lengtiis of ser\dce running between M and 
27 years, "Hiese people are obviously of considerable experience, 
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With respa^ to the aotual ^temewtog procadui'e^ai ii jisid, it 
Aould again hm noted ^at intandewing in the,,survd^ must not be 
regained in tlie sanie way as other fypea of intemewing procedure* 
Tfie procad^^fliat is ^sed in alecflon predictions li eiiehtiSly pie- 
determined by the itendardized questionnaire developed and by tfia 
e^Iidt accompimying inibuctions rather tiim by Ae discretion of ttie 
intefviewiri 

Hie bitervipwer is, in a senses much more a machinl rather than a 
professional pei^on . given freedom to exerctee his judgment, Aat is, 
apart from tim dioim of respondents in ft>sam jhng design* 

Changes in interviewing procedure for election purposes are^'really 
much more fte province of research design, On tills score Wjg might 
note a ff w dRinges in design that in turn affect th^ interviewing as- 
signment For example, in 1948 there was reason to believe' tiiat last 
minute shifts were of considerable, significance and would have to be 
treated in iuture research, and so there is greater emphasis ^s time 
on telegraphic surveys which involve the interviewer operating under ^ 
conditions M sttSngent deadlines, fast intervieMdrigp and a return tof I . :V 
the results by- telegram, lam sense, this improve? jhf jdesign creates • ^ > 
certain addlppnid possibihties for errors due to; Jbiift^^ 

Similarly^; Crbsiley in '48 initiate4. soiflA^ re^krch filtering out ^ 
ineligible voters and uninterested voters; This .pfocedufe of filters 
seemed ayery good procedure, and he reports that he has developed 
it further- .Agwn this pla^s upon the Interviewer additional difflcultieSj 
because these filters must be treated differently in different parts of 
fte country. Eligibility requirements for voting vary ip the most 
capricious way from place to place, and the interviewCT in exercising 
these filters in tiie interview must evaluate them differently from area 
to area. 

The otiier major improvement in question design which affecti the 
interviewers, notably* in Ropers work, is the use of batteries of "issue 
questions which attempt to define the constellaUdn of attitudes sur- 
rounding the preference which fdther make that preference stUrdy or 
make it precwious because of conflict between desire to vot^ a candi- 
date in and desire to see certain ends achieved with respect to issues, 
TTiis naturJly creates more diflBculty for the interviewer particularly 
because while Roper dkn easily, in this way, see the constellation of 
attitudes suirounding the prefyence, he must also be in a position to 
evaluate the hierarchical importance of different issues within tiiis 
constellatidn, which involves basically open-ended interviewing, 
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These Sam fomi bfv^e quas^oi} changei ftat in ^rn must be imple- 
mentfd in the iijtervie^iituation* course/ api^ from the speclflc 
prcK^uf^ of intemewiofr tfiere the general problem of rapport 
arid inter-person^ relations Witti^Ae responaent. (ai J^Js problem, ypu 
mi^t Ai^Aat Ae memory of the IW8 flasco^^^^l^e qkrrie4,by 
the population. and impede, effective interviewwf^ J^^ever, trend 
data ^Ueeted by die NORC since 1947 Indicate jprf^ j^ 
deolfament in {lubiic confidence is negligiblev Ip tf^O^pibdr tore 
"was an all-time low in sudl^ public co^dfni^^butp^^rp pollected 
ffiii mdnft ftdl^el^at to polls tjtve >e^ih^^ 
^the public; only about two percent of thfr riaHbn actually 
feeing really hostile to to poll, and about six out\of ^ery ten report- 
, ing a Javpratile viewv. " ' 

More dikn Ais^ some^^of the general psychological difficulties as- 
ioclat'ed with' Ae inter^wf situation in '48 seem to be^less^ operative. 
It was quit^^ common m '48 to obtain re^^s from intervieweiS^oFa ^ 
hidden Waflaee vote whie^ was not declared out of faar of stigmi^a- \ 
tion* TTiar<^ was eVtiin^n occasional report ^en bt annoyirig conf^ipn 
;tefce%rfeen to names ^ ^ , ' 

TU^ general problim of evasion as in the case of the Wallace vote 
and consequent response error ;does not seem so present now. Only 
sporadic difflfeMlties are being reported. The interviewers are remark- 
ing on the high levjjl of respondent i n teres t^ the willingnesi to tslk, 
and even to li^ct^at women/ normally very apathetic hi political^ 
pollsi are alert and interested. ^ . 

This would seem to be a brief account of the changes, or rather lack 
of change in to interviewing aspect of the election polls since '48.^ 
J am not ii^plying that the polls haye^made no changes elsewhere, or 
tiiat toy should make none; they have^ serious problems elsewhere, 
"and perhaps some minor ones in to interviewing field. But the prob- 
.Jems that are m'dst crucial seem to He elsewhere in the research process 
and to agencies show, sound Judgment in allocating more of their 
energies in to^e directions. ^ .J ' ' 
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an d T^ e ir Pro b ableJE%AAJa52-^ 
Electioiji Pre^ctioais . 

: ' SAMl/M STOUFFER 



ANALYSIS 

Dm* Samuel STOUFFrai I think that everybody in tiiis audience has ' 
a serious professional stake in what the jpolls do in this election. Let's 
not forget what some of our friends and critics who have no usa for 
psyohomeMos or for quantitative methods in general had to say in 
IMS, and how Some of them tended to draw the conclusion that human 
nature and human /behavior is intrinsically unpredictable. Hence^ it 
could be inferred^ We cannot even predict on an actuarial basis whether 
people wiU do wpll in coUege on ft^ 

This type of attitude was one which was lusciously enjoyed by soma 
of our tolleagues in the humanities^ and such distinguished scientiflc 
journals as The New Yorker, the week after the 1948 election, came 
out witii ohoioe statements fexpressuig gratitude to tiie pollsters for 
clouding up the prystal ball and ej^pressing respect for the Arnerican 
public for tellffig We thing to the polUtaker and doing the opposite iftv 
the voting booth, / ^ . ^ 

I want to make a few brief points on the Very large subject of how 
the polls are handling tfie analysis of the data they are collecting J^ir^t 
of aU, I want to say that I think the integrity of the major pollsters is 
beyond question » I think they deserve a great deal of credit for courage 
(for behaving cautiously^ because, it would probably be better from the 
i^andpoint of public reaction if they said they were sure that Eisen- 
hower was going to win or that Stevenson was going to win tfian to 
hedge. But they intend to speak definitely only if they are convinced 
tiiat, their data point that way; and their data are not likely to point 
c6ncl^sively enough one way or another to make it possible for -them 
to make a definitive cStatement • 

Why is this likely to be ^e case, apart from tfie problems of sampling 
and apart from the problems of interviewing? 
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The first point I wMit to miJce has to do y/itii ttie peeuliarity o| our 
Aeieetoral syAem/ In 1948, evetf If the polls had been right on the but- 

X gaoabl©* If Dewey had got about half a p^r cent mo^ of Uie popular 
' hfe 'would have won the'alection. No poll is going tp^be that close. 

^ ] -If Ti^ipiah''had raceivad orfa j^r ,cent more than he got^ Truman would 
hftve won by an elector^ landslide of four to bn^ie, which* would, ha 
been ona pf Ae j^o^t unprecedented ejectoral land&Udes in Americ; 
h^tpiy. tfcat^ was because the vote was so close in all of fha key stateS; 
Thtrafortj^lt is quite clear that eVan if a poll is exfrernely accurate, 
\ our electoral situation may irtB^a it impossible ^to predict an electoral 
y ^ vote witik ^y prjecislon at ill. Today the pollsters are, telling the peo- 
ple' Aat in evjery; way ihey can:possibly do it. But diey didnt tell the 
peopl# ttiat enough'.in 1948, Soma of ttiem said it, but they just didn't 
A say it sbongjy enough* Now they ate saying it, and of^ course others 
\ are laughing und saying, "Well, the pbllsters are Just not going tp take 
/any ch^fpes," Actually, of course^ it' was relatively easier in the Roose- 
velt period. There is a good reason why it was easier, and that was 
' showft by tiie fact that th% polls in the Roosevelt elections showed/a 
^ rellitiveiy .#maU^4AUinb€r ol undeeided voters, -People pretty well knew 
; ; . i^hat they thpught db He was either the gredt hpro who 

. shad saved the couhbry ^ri j later^the world, or h# Was ^*that man/* and 
. tibe nunflber of people who probably made up4their niinds during, the 
■ ^^^^^^^ '1^^ sn^li There was ; not much evidence of 

.fluctuatiblis after the first of Sep 

The l648 election represerited something very different indeed. There 
was a large number of undecided voters-^twice as large^ the polls diem- 
selves ' shoWi away back in September, 1948, as compared with Earlier 
elections, but experts did hot take it too seriously. Previous studies of 
, tfie' undecided voters showed that "they tended to go about like the 
rest of voters did; hence the tendency was to neglect the undecided 
voter. • ^ -' " 

. , The other thing that was neglected was the possibility of sotne last- 
' V thinute changes in the attitudes. The 1952 election has followed a 
V \cours^ tq, I would sayj'a week or ten^ days agor according to the 
pollSj that Js very similar to the 1948 electioti. ¥ou>have^ a big st&rtj 
.apparently, for Eiifuha^ver; gradually ,bpii.^g d^indled^a\yayj; and, then 
ypu.come to tpday^-^ a f^y days, before the election. Now, at this point, 
what is going to happehfljH'his has the 'pollsters teaTing their hai^^aiid 
they are making very Gar#|dl polls this week, trying to.see.whetii^r^pr 
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not there is any evidance of a very sharp pro»Deniocratic trend which 
happehed irf last week or whether the difference in the campaign 
procedures tWs year, ^rticularly the Repubhcans' efforts to. maintain 
their momentum to introduce the Korean issue with all the power 
they know how, will prevent any trend Wwards Stevenson from con- 
tinuing. 

One cannot make empirical generahzations with confldence that 
what happened in one election necessarily will be repeated in another, 

^ FIGURE I 
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But if Hie trend for Stevenii&n continues^ pollsters realize diat tiiey 
could^e facing on election morning a figure which shows thetwo can- 
didates ^out flfty-flfty in die popular vote, If, howevet^ the trend does 
nol continue^ die election will probably show.Eilinhower with a ttm- 
gin'of popular vote, whjch ltin"doesn*t mean he necessarily would win 
in die electoral vote. . e 
' 'Riis year die posters have sough^explicitly to take into account two 
different kinds of uncertainty which enter into responses. TTiese are- 
illustrated as variables in Figure the horizont^ axis^e have 

var>4ng degrees of probability of ve^irfg. On the vertical axfe we have 
varying degreej oJ^enthusiasm for the candidates. ? 

Now dispells can fijl in^each cell with freqtiencies. If you cut the 
horiz^fcal axis somewhere neftr the middle, on the assumption that 
only the more probable half of the voters will v^fj you can igonso^ 
date all the frequencies to the right of the cutting point and come up 
with a fl^nre as to how the probable voters are leanfng, But in the 
middle of the vertical dimension are a block of votfets who haven't 
made up dieir minds. We can ignore them and base oui; estipiate on 
the two upper'^and two lower tiers can make some assumptions 

aboi^ fliem. In people in the mffldle blocks voted for Trumgfi— 
but it is not safe to assume they will vote Democratic this year, even 
though their characteristics are Democratidf 

And there ar^ additional comptfealions. Many qJv those leaning 
towai'd Eisenhower are normal Democratic voters^ sdHie of v^hom ^y 
thevi are for Eisenhower bu^prefer the Democratic P^rty. Will they 
vote for Ike? Such people in ft48 ^lo said they wrfe going^ to vote for 
Dewey tended^ to swing back into the Democratic c^mp in the last 
week or two, i'his time they may mea^ svhiJ they say. But we cannot,,, 
be positively sure and that is why thfe pr put caution of the pollstei^^ 
is eminently Justified. * ^ 

Finally^! want to say a word or two about one of thc^most important 
things that the pollsters are doing, and that has to do ^ith the analysis 
of the cross pressures which are present in this election, I am sure 
that you and i^agree that prediction, particularly predicting a national 
election, is *a pretty dangerous tlftlg and it can have a boomerang 
effect, On the other hand, the polling data wJth all its intrinsic errqra^ 
represent the veiy best information which we have about the trends in 
public opinion and about the ways in which issues impinge on various 
classes of our population. 

I think we can be very confideit from what the polls have told us 



[73 



70 



^ 1952 INVlfATfbNA^ CONFERENCE 

:'v. . ■ ' V 

that tfif majOE^ of voterrin this election-we can have five or te^ per 
cent errbr md still be all right on this^really think they would person- 
ally be better o^{ the Democrats won the election. I think we can 
also ^st what t^ollsters tell us when they say the majority voters 
tii^ ffie Republicans, jan handle sudh problems as Communism and 
^miptibn bettei^than the Democrats. You have people in basic con- 
Ht ^ho ttiink tha|^ersona% they would be better oE if the Demo^ 
crat^ron, and tiity do^ flot like' the Communism arid corruptiorf 
business. Irish Catholics are a very*good example An J so the polls 
provide d^which make it possible to take Irish Catholics who my 
they are goini to vote for Stevenson, Irish Catholics who sfty they are 
going trfvAe for Eisenhower, and examine by correlational procedures ^ 
tj\|lr response! to a variety of questions, including some open-ended 
gtostiohf^on what they tHink dre the most important issues. The vfirilty 
^of Questions asked will give us a better picture than any other proce^ 
Bure known tfs to how tiios©^ issues impinge on various segments of^uch 
a population. It wont prove anytliing in terms of causation, but I 
think we can gikke^ inferences from it tkat are safer than any other 
^ kinds of inferences would be. 

^ I do mt wSnt, h^owexerrto sell the^redictign element short. In spite 
of all the diflSculties involved in this mattpr, I think one can say soAe^ 
tiifeg with a good de^l of corrtdence about the directions in which the 
vote is leaning in certain states, and that may be useful too. I like to 
look upe^his kind of prediction as^a little like the job of the Weather 
, Bureau incite longer range fnreAsting, wftere it^is forecasting the 
weather fof^e week^eridat ihe beginning of the week. I^w, the 
Weather Burea% going toJmakr mistakes; at will make serious mis- 
takes. It is making^, its predictibns^m terms of probabUity. It has a Job 
■ t0ng to e^5|cate tlw public to i^Iize that these Answers arent defi-' 
,,nite/Sbt ffilt they i^jpre^i^nt a probability "statement whicKis genu- 
inely tetter than ^the guesses made by somebody Who sniffs vdtfi his 
ifose and f&I^ut* the weather, ^e only ti^uble is that the public has 
been mi^edu&ed to some e^^ent and expects the Weather Bureau 
to sayAat there is ^ng to ie exactly 2.4 inches of rain or 1.2 |nohes 
of^a^. Thai kind ^prediction the pollsters can't make. But I have 
got a'good dfel o£ confid^ence thit^olling procures are going to be- 
come gtoore and more acceptable and that therQlwill be public support 
for the improvemen^^df the pwicedure^ for I have confidence in the 
integrity of fte pollsters^ , * * 
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Teclmiques for die Develqpment^of 
Unbiased Tests 

IRVING L 0 R G E 



DIFFERpCE OR BIAS IN TESTS OF INTELLIGENCE 

From time to time, SGientists need to reappraise the concepts of their 
science, their methods of measurement, and the a^lication of their 
knowledges for the general good, Psychologists, during the nature- 
nurtura controversy, have had to reevaluate not only the concept of 
intelligence but also tiiat of environment. For more than fifty yearSj 
tiiey have been revising the nieaning of intelligence, the various tests 
and procedures for its estimation, and. more especiallys the implications 
of the evidence from tests for. the understanding of children and their 
achievements. Andj of course, they have critically reviev^ed the ap- 
plicabiUty of general, and special intelligence, tests for die selection, 
classification and guidance of individuals, ^ . 

Psychologists, as v^ell as educators in the fulness of time may feel 
obligated to the authors fo^^^Intelligence and Cultural Differences," 
For again, tiiey have asked them to reconsider the meaning of test 
intelligence. As contemplated, the book has motivated anew^ serious 
reexamination of intelligence and of intelligence-tests. Perhaps the 
authors, too, intended that some psychologists should become emo- 
tionally disturbed by the use of "differences'' in the title in contrast 
with the use of "bias" within die text. Such feelings of disturbance 
may arise when such psychologists think of bias as some procedure by 
which some person-%ith malice aforethought" consciously prejudices 
a method of measurement to support an unfavorable (or favorable) 
opinion about persons, things or ideas. Few objective psychologists 
report 'differences" for the purpose of proving a bias or a disparity. 
Most studies of individual or of trait differences, beginning with Galton 
and including Eells, have provided the^eyidence that measurable dif- 
ferences between groups exist. In test-intelligence^ in particular, 
whether general or specific, differences have been found between 
groups classified by sex, and by age, and by education, and by 
geographic origin^ and by occupation of father, and by cuIturaE^back- 
ground, and by socio-economic status. Indeed, difference| have been 
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found in test4ntelligence between groups classified by body-type, and 
by physical heaiai, and by personality structure, and by nutritional 
status, and by family unity. Such reported differences from tests of 
intelligence have made test^makers as well as test-users increasingly 
aware of the multiplicity and Intricasies of factors related to test per- 
formwices of individuis and of groups. Not only are differences 
aflBliated with groups, biit Aey are affected by environment In- 
adequate stimulation, within deprivational environments, may affect 
perfoTOance negatively. Indeed, we ^ now recognize the interactions 
of heredity as endowment and environment as opportunity for each 
maturing individual Children, who during their early years, are de- 
prived of linguistic, and of social, stimulation, as a group, do poorly 
in test-intelUgence, and indeed, often are inadequate to cope with 
the range of aJjustnients the envi^nment demands. The fact of 
"differences" is well-established: test performance reflects the specifics 
of environmental opportunities of training, of experience, and of 
stored achievement. 

Test-users, have been instructed, overmnd over again, that an in- 
dividuals test score must be interpreted always ia light of an under- 
standing of the variety of factors and conditions that^re related to 
measures of intellect. Psychologists have provide^ noq^tive dat^ for 
a variety of groups because they know differences in test performance 
are related to sex, age, grade-placement, and socio-economic status. 
Furthermore, they have cautioned that a child's motivations and physi- 
cal well-being do influence test perfomances, . m 

Inevitably some users of tests neglected to profit from the tutelage. 
They wilfully Seated test scores as absolute determinations about 
individuals, or, even, groups. Others,^ of course, failecl to appreciate 
fully the range and interaction of circumstances that affect test^per- 
fymance. To overcome such perversity and such ignorance, some 
psyc^ometricians tried to be quit of the bins of the test-user by at- 
tempting to ehminate the differences from the tests. 

Usually, the attempt to make an linbiased test of intelligence is an 
attempt to reduce some kind of group differencG to zero, For mstance, 
it is well-kno\sai tiiat,boys and girls (and men and women) perform 
differently on tests of verbal,^nd of numericall content and process. 
For fear that the biased opMon that women are superior to men 
should predominate, psychome#icians, for upwards of a half century, 
have reduced "difference^'- by^dditiqn? All of us rtd fully aware that 
to overcome the obtained verbal ^iiperiority of women, the test-maker 
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adds a suffleienoy of numerical rfeasoning items to make the average 
total score men eqiial that of women. No different, ergo, no bits. 
Fortunately, there still are .differences* bet ween the sexes. 

Partial justification; todeed^ doeS' ^xist for such a procedure. In 
general, a test-score that is .based on a ^oinposite of many kinds of 
intellectual processes And contents does ' give = valid (ar^d reliable) 
estimates' about most person's potentialities for success with flie kinds . 
of ideas and skills^taught in schools. The imphasis sliould be on,"mpiri ' ; 
many, ho^wever^ .may be misappraised ' because a score from mgny' - 
different ta^ks Will fail to reveal tfie facjs -about differences withjn thi. # > 
individuars mental organisation, and hence, by extension^ fail to gil^e^ ,1^ 
information about differences in the mOTtd organfeaHo different 
groups. Of course, to apply Galtdns suggesHon of a^pr^siji^ 
"shafts*' does .require more time, than most test-users are wiUing to y 
expend. For practical purposes, then, psychome^fcianfe have accepted 
either Binets theory 'about the unitary diaracter of ihtejligenee, of : o 
Spearman/s 4prnonstration of the pervasiveness of '*g.*' The consequent* 
acceptance of the ^ngle index of mental-age; or intelligence .^iforienti 
or an intelligenco score led to expectations th^t thfese gebres werm th^, 
absolutes about a person. They are not The results of faotor Analysis 
have proved the nee3ibr the jneasuremeht of different aspecf?^df ip- ^i,^^ 
telligent functioning. Basically, differential aptitude te^ts; attempt to . 
measure "differences"' as 'differences. ' ^ % 

The measurement of. "differences," however:, is both cpstly m test ?J: 
construction and expensiv.e in testing times so that the aingh^index i 
score will exist for some time to come, It must be recogniggdi.that ^ 
most, if not all, so-called ^biased tests of intelligence are st 
172 fie^ appraisals* - 

Another method for attempting to produce an unbiased t| 
is to try to reduce group difffirert0 by subtraction. Esser 
of adding items to .copceal a<diffi^reuce, this method reii 
that produce the difference, T]|ie research reported in *^nti 
Cultural Differences* deals with: a technique for discdveri 
of items in som^ current tesfi of int|)lfg6nce that differSn| 
some socio-economic groups. E^lb," as a matter of fact, 
significant relationship between measures of social and 
and measures of intelhgence. H0 i^^H^ed, a^'qareful work 
fore hirti, that, pif itKe ' average, the test-In telligance of 
lower soolo-econdmie. status scores was jower^an tfiatof .thc^^fl^bse ^ 
status wsia higher, Tlie fact, of sociQ-economr|^!0erences in^i&st-in= r 
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,telligence Is reconfirmed. The implications of those factSj toOj 
to the development of social inventions to reduce the envi 
differentials which may affect the test performance of the sbqii 
economicdly less privileged. Indeed, the full history ofi 
educational legislation, and practice from the **01d Delu^l 
the contemporary requirement of compulsory schooling i! 
. lustrates the dynamics of demoCTatic social engineering, 
co-workers j however, took a different view of the facts* 

Tliey, apparenflys assumed tiiat the individuals in the va 
economic fbratiflcations were equal in intelligien^. Hencej 
erences were found, it must be the test or some .kind^vpfi'^* 
.that produces the di^erences* Thus was created the lo^ _ 
differencey ergo^ bms. In avoiding the one horn, psy^blo| 
inevitably be embarrassed on the otlier. In facing the al|p^^^ 
1 teyer, educators and psychologists must be aware of 
tkie procedure, Eells, having established that dif 
test-intelligence between groups that they assigned;^ 
stratat*proceeded to select t#6 samples at either ex 
score range, namely, children of old Americait stocfi wno w| 
qlassifled as of very high or very lo^ status by the credteon^ 
posite Index of Status Chftraeteri|bJ6^- He, then, nia^r^ 
of ^ Jarge pdrtioh of' the tasks flie sever|l^tf Iligen^ 
the individuals in each extrent6 had takenr Since the median j 
of cortedt^e^p^Ses for thr High Status group was about8i^|in^, 
th0 jjpw Status group' was aboi^ 70, it must foUw thajtj^^^Wity 
#0/ the iem^i^^ll favor the Jiigh Status groups, was 
established ^^ihe ana^ interesting finidifti^^mi^.^&s jhat 

the fflff^iMes in item performance between tM^^^^^rif #atus 
' groups his "a direct relation to the f orm ^^^p^pP in which the 
: item ^ pipressed." The High Status group is^io^Nidst on verbal 
items, but the gradient of difference becomes^^and less for items 
-*:^ased on jprieaningl^pnumber combinationsjSid approaches zero 
/for items , involving "pctures, geoinetric^esign/ ^d stylized drawing. 

Apparently} the discove^ of such a Syfnbblism difference suggests 
' fhaUa'culture^fair||iBSt could^^ made of those tasks that minimize. 
Vergil processes sAd that ra^r those t|^^quire the manipulation 
of number%geometri9 designs and pict^^^^h a test of such taslcs^ 
^ of cdur^, can be made, put, ^if it were/^wtf would it measure? It 
seeftis expensively trustful to p^t reliance only on such items that fail 
h between ^emopsfiSbly different:"^^tus groups. Some ^ 
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criterion about intellectual functioning, other than the one that the 
items make for no diversity, seems, at least, a psychological prere- 
quisite. Certainly, witfiin each extreme, variation in test performance 
must have been symptomatic of intelligent behavior that, to a very 
large degree, was a consequent of differences in ability or aptitude. 

If such an unbiased test w^ere pBoduced by subtraction, it neither 
would be a test of intelHgence nor >vould it give any evidence about 
tiie impact of status or culture on test performance. Certainly, Eells 
and his co=authors had methods available for item selection that would 
have maintained some relation to a criterion for intelligent functioning 
while minimizing die impact of status or culture. At least, partial 
correlation would have led to the making of a culture-f air test without 
losing the appraisal of intelligent behavior. At best, Eells' method 
could produce a test— but it would be a matter of conjecture as to what 
such a test measures. Clearly, the evidence from the many so-called 
non-verbal and non^anguage tests suggests that what they measure is 
different from what is measured by the so-called verbd tests. 

Of course, the adminisjtration of the same verbal test to groups 
maturing under different language experiences would favor the group 
for whom the test language was their own vernacular. Test scores from 
a verbal intelligence test aesigned for Chinese would certainly put 
some A^eric^s at a disadvantage. Indeed, not only will groups per- 
form differently if they are separated widely by their languages but 
also if they have developed different cultural attitudes and values. Many 
psychonietricians have endeavored to produce tests which are culture- 
free. From the days of Army Beta, atternpts^to remove the differences 
attributable to culture have been ingenious although not fully success- 
ful. To the long line oyiuch tests, inclucling Oodd's International 
Group Mental Test, Cattiirs Culture-Free Intelligence Test, Spear- 
mans Visual Perception Test- and the Multi-Mental Non-Language 
Test, should be added Rulon's Semantic Test of Intelligence. Eafeh one 
of these ventures to achieve an unbiased test by substitution. Since 
the fact of different cultural and linguistic background prohibits the 
use of the language of any one group or a language common to all 
groups, the test-maker attempts to appraise intellectual, performance 
by the manipulation of objects, or of picture, or of designs or of 
numbers. The tasks, set by the psychologist, require intelligent be- 
haviors of perception, selection, generalization, and organization. In 
cross-cultural comparisons, however, differential experience with 
pictorial representation, for example, may signiflcantTy influence the 
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way flie tasks are perceived, the specifics are selected, and the way 
such aspects are restructured. Some of you, indeed, may remember 
the non4anguage item in the Army Non^Language Test The task 
was to cross out the picture that did not belong with the o|her four. 
Chinese inductees, inv^i^ly, viciously and erroneously crossed out 
the illustration of a rising because of it| symbolism for them. 

Rulon's new semantic test/should prove ultimately to be a fruitful 
lead. In essence, it sets the tisk of 'beaming" to associate a geometric 
symbol for a concept generalized from a number of drawings of 
worldly events. The process involves the acquisition of a symbolic 
glossary which is tested byVequiring the subject to show hm mastery 
of die glossary not only as individual signs but also in combined, 
semantic and syntatic organization. Involved , in the task of learning 
the glossary and in demoAstratihg mastery ovet it, is the additional 
one for the subject to infer what he is to do. Basically, the kind of 
learning is somewhat dike associating a Chinese ideograph With a 
concept generalized from several pictures. In contrast with.themore 
extensive spoken or visual vocabulary, the Rulort glossary approach 
involves very few signs, meanings^ and syntactical patterns. Under such 
limitation, the process differs in complexi^ from the more usual tests 
of verbal intelligence. Rulon, indeed, Bnds that the cofrelat^ between 
Stanford-Binet mental ages and the score on the Semantic Test of ^In- 
telligence is very low for a constrained sample of feebleminded chil- 
dren. One reason, but not the only one^ may be that the processes 
tapped by the Semantic Test are quite different from those appraised' 
by the Stanford-Binet. The added evidence contrasting the relation of 
school achievement with the Semantic Test and with the Stanford- 
Binet supports the belief that the two tests are not measures of the 
same functions* 

Test^makers apparendy have fried to eliminate bias from the ap- 
praisal of intelligence by covering^up group diff^ences, by eliminating 
tasks that make for group differences, or by Substituting different 
processes in evaluating groups. Do such procedures really remove 
the bias from the measurement of intelligence? My answer is No, 
They do reduce, of a certainty, the amounts and^dnds of information 
about test performance of separable groy^ Scientifically, however, 
ignorance of difference is a costly way to produce unbiased tests of 

intelligence. * , . - j i 

The objective psychologist cannot fail to see the rediicho ad abaiir- 
dum of making unbiased tests of intelligence. For instance, following 
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die implications of Eells' procedui'e and findings, a test involving 
manipulation of numbers, geonietriG designs and stylized drawings 
will probably favor men and boys^, Will it "tlien be necessary to select 
from such items tiie few on which women and girb will be equal to 
men? And if diis be accomplished, should only those items on which ' 
Gndomorphs make performances equivalent to ectoniorphs be retained? 
^ There can be little doubt that among some kinds of groups differ- 
ances do exist. As a matter of fact, the wide range of general arid 
specific tests of intelligence has made it possible to establish much of 
the available knowledge of differential psychology. Not pnly has the 
awareness of such differences led to the emergence of a more adequate 
understanding of the relative adyantages and ^limitations of intelligince 
tests hut it also has increased our appreciatiorf of the significance of 
diflerence in. the understanding of children as- individu^s, and in 
groups. In a democracy, such as ourSj respect for difference as differ- j 
ence is necessary. There is no virtue in dSt|^9p|ng instruments so, 
blunted that they decrease the amount of information, Perhaps the 
best method for reducing bias in tests of intelligence is to use them 
with the full knowledge that endowment interacting with opportunity ^ 
produces a wide range of differences. Appraisal of the variation of 
different*kinds of intellectual functioning requires many kinds of tests 
so that the IB^fferences can be utilized for the benefit of the individual 
and for the good of society. Intellectual ftfnctioning certamly does 
involve thf ability to' learn to adjust to the environments or to adapt 
the environment to indiviclual needs and capacities by the process, 
of solving problems either directly or incidentally. Such a concept 
recognizes a variety of different aptitudes for success with different 
Jcinds of problems. The full appreciation of the variety of aptitudes 
and the development of adequate methods for appraising thenii should 
in the long nm, ultimately lead to the production of enough informa- 
tion to eliminate bias. 

As the psychologist develojls tests to measure mastery of different 
contents and pxocesseSj he will obtain the evidence about the in- 
equalities of opportunity for maximurh development. With such in- 
iormafion, the psychologist, in cooperation with educators and others 
interested in locial amelioration, will try to make those social inven- 
tions which will allow all in our democracy to have an equal op- 
portunity^for maximum development of their potentialities. The full 
utilization of such social inventions and social engineering will not 
eliminate the established fact that^ there will bo differences among 
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individuals and between groups. When dljfferences are reduced by 
the advantages of opportunity, the credit will be to the tests that 
showed their existence. Difference as difference is not bias, but the 
infonnation about it will lead to the gradual disappearance of sonie^ 
kinds of bias. ' 
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NoN-VEBBAL TEBT^ ot i^ltellige^ca have never been satisfactoryp TTia^ 
have not correlated well with verbal teats ol intelligence, nor'^with 
success in inteUectual or academic endeavors. . ' 

In the case of jnany non-verbd tests, the intellectusA operation oaUed 
for does ^ot seem to be the same as that called in academic or 
intellechiai pursuitSi ^ L 

Hie distinction between the usual Verbal test and the usual non- 
verbal test is not so clear when easy items are consideredj as when 
more subtle or difficult items are examined. The strictly non-verbal 
^ tests of the past have by and large reh'eated frota^theAf unction they 
\^ were trying to gfet at whenever they*were made difficult enough to be 
useful in selecting a few of tiie more able members of the populatian 
tested, ^ . 

The problem Yundertaken by the present investigators was to de- 
velop a testing technique which wbuld be free from the more or less 
glaring shortcomings of the usual non-verbal ttist, and at the samd 
time be free from some of the commoner defects of verbal tests, 

The following; 'defects in the typical Ao^uerbal test were regarded 
as worthy oFajVoidance* ^ ^ 

1. In tiie admiijistrayon of some non-verbal tests^ verbaJ instructions 
^e employefi to tell the examinee what is required ^hin^ 

2, The examiheg is presented with, novel materiar whicn^epriyes 
him of tfie opportunity to exhibit any use he may KaVte made o£ 
opportunities to make ordinEuy observations of th^ surroundings 
in which he has lived, ; ;^ 
A time limit is sometimes imposed which renders the non-^rbal 



3. 
4 



test a SMe4 test rather than a power test, 
verbaistests require the imi 



Some n^^ver^aistests require the nAanipulation of concrete^b- 
Jerfsj such as bloScSj inarbleSj or other simple familiar tilings. 
It ii hard to contrive items making use- of these materials such 
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tiiat tiie items are difflcult in sense ordinarily understopd by 

ingfiliec^iial dlfiBc^ltyvi^^V ^ * 

Some non-verbal teste req^e tfie reading of symbols (sttch as 
Arabic digite) which fnay Be, nonJanguage sWctly ipeaking, but 
which are neverthdess associated ^di Ui^ use of lan^age in 
our culture. 

non-verbal tests require a verbal response from die ex^. 



Laminee. • • 

7. Some non-yerbal te^ts, such as form cpmparisofi^ tests, put a 
premium upon visual perception almost to the extent of feward- 
ing visual acuityi the difference which the subject is requirerd 
to detect between two geometoical figures may be so minute a^s 
to present essentiaUy a problem of visual acy^^ : ^ 

The deiBclencies. in certain varb^ which were regarded as 

particularly tp be axoided include the f oUowingi 

1. Some verbal tests give an adyantage^ to persons from certain 
culturdl backgrounds, regMess of the langiMge employed. Tliat 
V ^ is, the eontpnt is more familiar to persons 'from one culture dian 
tb diose from anodierp * ; . 

. 2. Some verbal teste require an exhibitiqp of prevfeli^y acquired 
' knowledge, rather dian testing a skill necessary for accomplishing 



*anewi 

3. Some verbal tests are essentially s^eed tests. 

4 Som,e ^verbal tests dlow a free response which causes scoj^ng' 
dafflculties^ . ^ 

5. Some verbal tests put a premium upon the examinee's facility 
with his native language. ^ 

It was the pu^ose of our work to derive a non-verbal test technique 
which would be acceptabie on general grounds, and be free from as 
many as possible of these undesirable characteristics. 

The work was conducted injthf H^ard graduate Sohool Educa- 
tion under a contract betwy^':^e|ft^ident and Felbws of Harvard 
College and the United Sta^^iS^iWnent, ^presented by the Per:' 
sonnel Research Section ^E^M^fe^W Research and Procedures 
Branch of die Personnel BureW^pf tt^ Adjutant General's Office, De- 
partment of die Army. . . 

The manner in which we have attacked this problem dan be seen 
best, I think, by your now owning your test booklet to ihe left-hand 
inside page/ This page pretty closely parallels thp flrse^ge of pur 
42-page test booklet as it is Mw arranged, In giving the test, we make 
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motions indicating ttat tfie symbol it &e top goes with the five pic^ 
tures. Motions tilen indipata to the exaniinee that the first symbol 
in til© axerbiw i^ identical to th^t in the definition abovel Tliis Js done , 
wltiiout saying anything. Searching motions among^ the five options 
in &e first exercise terminate in locating the COW WALKING in the 
tiiird 'option* Motions of c^ndparison between this picture and the^ 
Jdurtii gictitf e ahova and^so motions of comparison between the 
symbol at (tie left and the symb^ above terminate in the examines 
drawing ^ circle around the third octagon in the first exercise. Why 
dont you now dra^V a cirde around the third octagon in the first 
exercise, . • . ' 

Similar matlorts of comparison terminate in the examiners circling 
the fourth option in die second exercise. Suppose you now ciMe tiie 
JUMPINO COW at tiiat place. The motions are again repcEited, still 
without saying amything, and the Jrst octagon is circled in tiie third « 
exercke. I suggest that you circle that StAKDIiNG COW and then^ 
go 'wife the rast of the page as our examinees are encoura^d to do* 
^On the adjacent right=h^nd page you twi]l find a lay-out very much 
like page 13 of our 42-page test booklet You will see that the symbols 
at thQ left alternate between GOW and JUMPING. For the first 
exercise we^ make motions indicating the siftii^^ity between the 
symbol in the exercise and the rights-hand Symbol ^bove^ arid then 
make searching motions among the five options which terminate in 
the second option. Motions of comparison betwe|^ this picture and 
the Woman jumping in the glossary abovf ^tomin&te in the 
examiner s drawing a large circle around the second option in the, 
'first exercise*, Jn' the second exercise motions of comparison are^made 
between the symbol and the C0\y ^yuAol above, and searching mo- 
tions among the options tennin^<|jn the fifth option. Motions of com- 
parison between dJs option. an^^^M* WALKING COW aHbve terminate 
' in the examiners drawing a largfe' circle around the last pption in iteip 
2. Similarly in the next item the faur|h option is circled by the ex- 
aminer* after which the examinee is encouraged to circle the ap- 
propriate options on the rest of the pSge; Suppose all ojf you go ahead 
and do diat at this tipne. * - . 

So far we have been en^ged in=a relatively simple intellectual^ 
operation which may be Jdentifled as a digit-symbol substitution 
exercise, excfept you may ha^^e noticed that in the second exercise 
on tiiis page the WALKING CO^ was a mirror im^ge of the one 
in tiie definition* Furtiiermore, in the third exercise, the JUMPING 




CAT was not same CAT as in t^e glossary'%^tiie top - 

In tha nrti axerdse— that is tfie fo^ orie-->^ou circled a jumping 
dnimal wUch was not shown at aUyn the fldsw^ tor JUMPING. In 
marking diat merciBe you must havl^abstraeteid the ortlpept Of JUMP- 
ING from the actions s%wn in the gloisa|y at 'tfie top. You couHnt 
b%yg_ marked tiia<%r^we|^by a simpre digit4ymbol wbstitutlon, . 

tjK to bade page of your booklet I liave shd wW' you j^^t happe ns 

/ on pagQ^ of our 42-page booklet* In dedin^\^dth thfei first item, flie 
' examiner must here make b¥o sets of motiorts of epm^arisoiif By such % 
motions he shows that tfie first symbol agrees with tJie COW symbol 
above, and &e ieoond^ymbol agrees with the JUMPING «ymboU^ 
above. The searching njotion^ among the options terminate wlfli ^ , ^ 
option 4. Then motions of comparison are used between tills option , 
md the JUMPING COW at tjie top left, and oft^ 
parison between this picture and the JUMPING COW at the upp^ 
right/ Tli^se motions tenninate In the exmniner^ * ■ 

COW in tibe fourth option.^ In tiie next exercise similar rrtotions ol j^-iffi 
companion tirminate In the examiners circling the first option. If you 
will circle Qds optl^ and in the next Item -circle the^seoond option ■ ^ 

after comparing the syinbolSj I will then turn you loose on your own, 

I have a few more remarks to myce after all of you complete the 
exercises 0p this page, ^ * 

As you m^ well suppose, the next step is to intrqduce threersymbol . ^ 
sentences, su»h as MAX BEATS HORSE or HORSE DRAGS BOY 
or BOY Beats man. The highest level to which we are now going Is ^" 
to the four-symbol sentanc^such as WpMAN KICKS' DOG LYING 
DOWN or MAN BEAT& ^to^AN RUNNING, and th . 

I am sure you must have got the idea by this time wfty .we called it 
the Semantic ^est of Intelligence* What we have done is to. imitate 
in a^ion-verbal test tiie semflntic relationslups presented in the ^pical ft 
low-leypl verbal intelligence test; that is, to require the subject to 
a^ociate an arbitrary symbol with a worldly referant, to indicate his 
mastery of this association, and then to combine these sym^ls into 
; grbups in which the reU^nships between the symbols in eac^gfoup 
^e semantic or syntactical relationships, %. s,^^^ 

In order to avoid putting a premium upon urban cuUtffe or albouMl^^ 
' of schooling, it was dedded to use worldly referants only the e^ots, 
.verbs, and objects familiar in all western cultures^ ^ven the mpst ^ 
primitive. These were felt tb be sex differeptiation, the^ yoiing of the 
spedes; and' dom^ticated animals^ as far as the, nomln'atives were 
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'^ eoncemed^ and simple objects Uk© bowlSs stools, freei, etQ,^^}iL addi- ; 
, / tion to men, women, ^lildrenj uc^ommofi animali— for the objefts^f 
tran^ltives. For intrusitive verbs the most universal actioris were ^sed* 
f standingp walldn^ runnipg, jumping, sitting, and the like* For ttansi- 
\ rive- verbs again Ae most primitive-operations upon objectives were 
. emplpyedrpuAing, dragging, lifting, beating, chasing, leading, etc. 
The test is-^non-VCTbal to the extent of bein g admini stereld " without 
any word in any Iwaguage iming spoken by anyone. 
The appearance of vaUcfty oftfthe material is not merely sliperflcial, 
' " ♦ sltlce^e operatiQns required of tfie exami^efe lue the simpler linguistic 
or semantic operations, not ^st operations thought up for the purpose 
of constructing a test These operations are undouDtedly related to 
^**Ae operations of reading in any language. ^ 

^t has Jbeen found possible td construct a test of substantial diffleulty 
which ^es not seem to offer any reward for visual acuity or pure 
visual perceptidn, 

1 AJso ij seems possible now to produce such a test- using siich ma- 

terials as not to give any adv^tage whatever td thp Ndrtheni child 
over 'fte Southern, the white over tiie colored, or the tirtie-server in 
' sAool over the bright youngster wth less ich . ^ 
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■ f I : His afterooon I wiU l^^p iny epminents around four points: 
/ 1 ' 1. Whit I' b^ava to bfe AebreticaJ^^n^^ modal, 
v / / or iiorrittelJ foOTdito 

; has davdoped.anittrivedin ■ ' 

2; Spm© of fli© btod 'aUa^^ have been led into by 

;fta us© of an inadequate wd outmoded conceptual model; 
3. Sbma^^fcedfle resw the^opie of bial 

m intelUgence testsV Md ; 
4 Soma ooniideralloni vil^b must be taken into account if wa are 
to maka toy reai progiiii toward a me wiingful solution of tte 
' piroblenni of d^valbping "Unbiased teats/' 

About toee decadei ago^ tiie testing movement sank its tap root 
into &e flalds of education a^d piyohology, 'Riis proved to be fertile 
soil since tiiesB groups, at flie time, were first and foremost imxious 
to.-baJdentiflc, ©bjeQtive, an4 quantitative, TTie influence of men like 
T^oradike and Wateon was at high noon. Most of Irtiorndike's genius 
was devoted to activities which involved pioneering in areas where 
' ©xparimentotion and qiiantiflcation could be applied, Watsoti's recently 
fomiulated fiehaviorism attempted to flush out of ^erican psychology 
all of tiie subjective, conscious cognitive ^processes, and set up die 
dictum tiiat orJy behaviors which could be observed objectively were 
wo^y to be legitimate data for tihe science of psychology. 

iW tiiase, schemei^ the subject was considered the equivalent of an 
independent, isolated, physicdistic machinei die experimenter im- 
p^ed stimuli which in turn elicited responses* This machine could 
Aus^ mmipulated md its characteristics studied by the psychologist, 
nluoh as die physicist studied physical phenomena in his labbratory* 
Tie task of die scieoyit was to control (and measure) stimuU and odier 
anvironmental concUtlons, to observe (and measure) responses, and to 
determine the relation between the two. The role of di^ scientist was 
j that of a dma $0 mmhina^ equipped with an inelastic "foot-ruler," 
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Com^leXp adaptive behavior was ie#n^ in diis ai oonglonie^t^s 

of sirAple elemanta, whetheF Aay were called ^naij^ ooqpfctions'^ 
or^^nditioQed reflexes*^ Such we^^ tiie psychologisia atottii. Given 
enough ot &em^ and in tiie ri^t pro^rtions» had whal^^^^called^ 
far exmnple, ^inMligenee^ (ef , 37), ^'W 

^ Undar the influence of this philosophy^ icienttfio method and 
pr^gedure. the task of the intelli genee tester was r^yvely easy— all 
he had to do was to present stimuli (items) to tfie subject, and de- 
termina tiie adequacy (some measure) of the subject s responses. But 
underneath all Mb was a more {undaniehtid set of ideaii jiamvly those 

' wbicli characterized the early nineteenth century physical sdences. 
Thesfe ideas were reflected directly or indirectly in the practfoas of 
the early, and to a large extent tfie present^ testing movement, 

Looiely s^ted, some of ttie underlying asiumptions of- this con- 
ceptual scheme which are relevant for our consideration, today arei 
that the phenomena to be studied were stable; that they were a 
/'closed system;" that the '^closed system" functioned In a manner 
analogous to Newtonian formulations of thermodynamic laws; that 
the variables uled in describing ttie phenomena were 'independent, 
and quantitative in linear, unldimensional terms; and that these 
variables or dimensions of behavior could be measured by the ap- 
plication of some form of external **foot-ruler," which could be ap- 
plied by any trained impartial observer who was removed from, and 
was independent'of, the phenomena being studied. ~ 

I will return to comment on some of these assumptions from time 
to time. But flrstj I want to say that it is too badj and a little ironic, 
that the educatofs and psyehologlsts who saw their path' to sclentiflo 
respeotability in Imitating the physical sciences^ imitated a eonceptual 
framework which was already discarded as inadequate by the very 

, discipline which developed it. Clerk Maxwell had published in 1877 
his Mattermnd motion , the work which opened a new era in physical 
science tiieory. But those educators and psychologists whose thinWng 
and research were patterned after the classical model with which tiiey 
were fainiliar, were intent on being "sdentlflc" at any cost, and the 
cost has proved to be high, j . 

Part Ih ' f 



What have±>eTO some of the effed|ts on the flTO of Ij^elHgence test- 
ing tiiat resulted, directly or indirectly, from followingjie theoretical 
model of classical physics? Therf have been i^ial, ^Hc I.wll limit 
myself today t^ three major groupings; the conf^ton of problems with 
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te^miques; the cgnfusloh of facts wito artifacts and the generation of 
pseudo iisuesi Und the reluctance to considef approaches which 
deviate Srom orihod% theories and techniques. 
L The cmfiMm of problemB with tech 

In ludd discussion Hof &is toplc^ Maslow (23) has listed several ^ 
consequences that result ^en the techmquwor means of invesl^^vji/" 
nibn of ^cientiflc proWeirii are oonfused witk^lhirpTobleTOi^tiiBMp^ 
Some of the QOnsequences ^e lists are: the tendency to lay sfress on 
'rfegancej polish^ and technique^- to over-value quantiflcation, m^jF*/^ 
criminately and as an end in itself, to fit problems to techniques rather 
than vice ve^aj to develop BXk ordiodoxy by, those who use the proper 
techniques, which In turn tends to block the* developm^t of new 
methods^ to exdude muy problems Irom the jurisdictiop :pf sciencCi, 
^and to make scientists want to be **safe," Father &m daring and crea- 
tive. ' ' " ' ' 

I diink that examples of all of these trends or tendencies can be 
found in tha histoiy of ttie testing movement. Perhaps it was because 
educators and psychologists felt so strongly the need to be "■seientiflCs" 
an^ at the time had precious little else that held out such promisCj that 
they assumed that fliis would be a convenient esc^ator to scientiflc 
status. In any case, tiiey seem to have devoted their energies to the 
development of the means or *techniqueSj and have f orgotteo some- 
what the basic problems they set out to solve. Indeed, in some cases, 
Sthey seem to have iubstitnted the means or tephniques for the prob- 
;4eins themselves. In making this switch, they were sometimes criticised 
ithat their tests did not measure intellectual potential after all; the regly 
has been that their tests did predict school achievement pretty weUi 
TTiVy seem not to have questioned their purposes^ but rather to have 
justified their techniques, But let ur be more speciflc; let us consider 
the kinds of concerns that have preoccupied test constructorSj with 
occasional illustrative refertaces to our most esteemed test of intel- 
ligehcei the 1937 Revision of the Stanford-Binet, I am selecting Itj 
certainly not because it is more vulnerable to criticisrh than others, 
but becatuse'we probably know more about its standaft^dization than 
we do about any other test, and because various other tests have used 
it as their criterion of "validity.'* ' 

"Hie primary concerns* of most intelligence test constructors can m^st 
likely be summed up in three -terms^ item-difficulty, reliability^ and 
validity— and probably in tiiat order of Importance, I have picked 
tihese Aree tenns oecause they have formed the essential justiflca^n 
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for mwy testa of inteUigen^. Pefionally, however, I don't think tiiey 
are independent €oneapti at ^1, but I will try to spe^ of tiiem^ 
sapm^^^iinee they are supposed to belcept separate. 
Let uttSka Wam-di^miWy to 

terms of fte usuaLtest-ironitaction procedure Aere is no 
absohitf WW oi estabUshing teue item-dlfflculty, Wce 
^fflcdftr of a given item, or set of items; is in pacH^^ 
*operatidndly' by &6 proportion of ehiiaren of a given chrono" 
- logical age pais it ' ^ r i 

^ftis pdmt may he clarifled by toacing briefly some of the pro- 
cedures used in test construction ^nd s^kndardizatipn as follows • 
Let us assume that a test constifuctor finds it necess^ to estab- 
lish, flrs^ ttiat a given propprtion (e.g., 50 per cerit) of the persons 
. wi^ sL given CA pais an item with a given sigma; and, second^ ttist 
an item on an initial testing is found to be *too easy^ The test 
constructor in tfeis case usuflly makes tiie item 'harder by either 
y rescoring or rewriting it * , 

'Tn rewriting an 4tem, two procedures are generally used to ac- 
complish this desired purpose: either (a) to make the mental prob- 
lem more difflcult, so that more mental ability is required to solve> 
^ it, or (b) to retain the same mental problem, but change fte fom 
of tie item so tiiat fewer children at a given age level p^s it. In 
tlie latter case, ''this is done most easily-and most often— by 
manipulating verbal, etc., factors in tke item, usually by using^ 
mora 'difiaci3t-Cl-§r, esoteric, unusual, of aeademic) vocabulary in 
presenting the mental problem-to-be-solved. And, jjecause of ttie 
, statistical detotion of item^ifflculty, it is;^ not ^ssible to teU 
whether the mental problem i^ally was more difflcult, or just 
accessible to fewer of the children in the sfandardization . group 
because *q£ tfie unfamiliarity of the vocabulary or other language 
forms"' (17). 

, It is true. that by this procedure the mean mental age is raised for 
the jtem, and Aat the sigma of the distribution may remain the same, 
But this type of standardization procedure leaves sever^ rather Basic 
questions unanswered. Let us consider two of them. First, we do not 
know what happened to the relative position of the individuals be- 
Jp.' tween Ae first and second distributions. Please bear with me wMe 
I make the following assumption: ]uA suppose that on the initio 

* In mom tests which use a simple scoring method, the item is generally 
rewritten and again tested, or it is disearded. In individual tests where more com^ 
pies scoring proeedures are possible, reicoring as well^as 5®^"SP8 is used m tlia 
stwidardliation procedure. In the 1937 Revision of the Stanford-Bmet^ for ex^ample, 
the tests and ledres were reviied six times in order to obtain proper distributions 
for items on Fomi L of tiiis test (36, p. 23), ^ 
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taiting for our h>^othett^ item Aere was no difference between tiie. 
perfonnances'^bf Ibw-stahis to^ children, and suppose also ; 

fljat there , was ^ simiicmt difference on the sqcond testing, Tlie 
shuffling of Aete3kre *p5sito^ of Individuals in" the sec^d case 
>¥0uld not be ap^ent at all from the mean and sigma of the second 
disWbutibn; iffrive^d at by tiie procedure I have described, which is tiid 
Uififfil -^^ Fiir^^nrft^ I strongly suspectttat in such cises^ itemt 
are more often made Tiarder^ ttian they aft iriade "easier," probably 
^ecausa the floor of item-^fflculty is set by the actual difflculty of die^, 
mental pfoblem4o-be-solved, whereas tiie apparent ceiling of item-^ x? 
difflculty can raised eaitty by the istroduction of sMch artifacts 
as I have suggi^ed, ^ * ' ^ 

A second unanswered question shows how difflcult it is mt me to 
Ictep 'itenj^difflcufty^ and ^vaaidlfy'' separate. It has to do with tile 
question of \ . 

''wh^tiier it is necessary, or even desirable, to oonfoand problem- ^ 
difflculty with voeabulary=difficulty in intelligence test items, One 
re^dn for believing thatvthese two aspects of an item of ten are-r 
confounded is tfiat for many intelligence, tests the vocabular)A^ 
score usudly has the highest correlation with fhe total test battery^ ^ 
(Terman and Memll [36, P. 302], for example, cite a set of such 
correlations for single age groups which range from %65 to ,91, ^ 
witii an average f of 31 for their test,) If problem- and vocabulary'^ 
difflculty are confoended* {yhether intentionally or unintention- 
ally), it is highly unfortunaA in view of the knovffe diffaj^qes in 
the extent to which children irom widely different sociSTclasses 
are exposed to' ihe academic l^guage permeating most current 
intelligence tests. It seems apparent that the removal of a vocab- 
ulary-bias whi A favors middle^lass ehildren does not lower item 
or test validi^ but indeed ii%e&ses the validity if one attempts 
* * to measure problem-solving ability rather than vocabulary" (17). 

The question of whether /Vocabulary-bias which favors middla- 
class children" is present in an Item cannot be answered by the statis- 
tician; it can only be determined on the basis of socio-anthropological 
field research^ , . 

' Next, reliability^ I Imve difflculty in understanding clparly what 



* Bf Sides being used as a hieasure of the stability of some phenomenon or be- 
havioral eharaeteriitid over a period of time-which I have discussed here-the 
term "reliability*' is often used in other, and quite difFerent, senses. Theje include 
the homogeaeify of Uie^easures of some' particular ehuraetcrislic within a person 
or sihiatio Jfcieh are kmpled at a given jtime^ and as the indo^of the consistency 
(or *'obiec^J^1 of the scoring of a paryeular sample of behavior by two or more , 
persons or^^podj of evaluation. 
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is ni6mt by JfeUability-^perEaps b^caui^!there are so m$my tyjpes of it.^ 
But ! do^at ^e^^ta t6at It is aliftbst a*rule-pf-thumb to say/ 

"If your tbit^r^it rdl^fbUity leaves \ypu with a. large error showing, try 
Uia iplit^hdf Ipfthbd^ud then incre^^ reliability a^ttle more 
by using S^peahnan-Brown iQ^uIa.^' . 

I Mn not entirely faceti^bs in iQ^dng this statement/ because last 
ygiucjwe.^teined bo^^it-half and^^fcretest leHabtUtLes^on^a 



known intelligeDce test* The unconrected split-half reliability was .97; 
the tast-reteit reliability^ after about sixteen months/ was ^67, (The ] 
number of eases was 68/) I will not mention the particular test, be- 
cause I suspect this sort of thing happens with more than Just this ' 
one^ and besides, the general problem, and not the speciflc case, is ' 
the Important ttiing;* for us to consider* In this connection, 1 would 
like to m^ii^on Gullilmns point in Cb. 17 of his book on the Theory 4 
of-^;Mental Te&tB (16), namely that the use of split-half reliability on 
speed tests, where the unani,wered itims are counted as being in- 
correct, yields spuriously high reliability coefficients, Tlie Stanford- 
Binet test is' riot open to*this criticism, but several otfiers arei 

In the sense that reliability is commonly used, what does It really 
mean? It m^es sense to me only in terms of tfie classical physical 
science model, one assumption of which that the object or 
phenomenon under investigation remained stable. Thus, if you meas- 
ured a lead ^ brick with a given "foot-ruler" at one timej and if you 
went hagk on a later occasion and measured it again, and if your *Tpot- 
ruler" gave you the s^me re&ding the second time^ you could s^ it was 
reliable; your measuring instrument did not change. You, assumed all 
along th^ what you were measuring did not. change/ But, the mrobf 
lems of measurement that we have to face are not ^nly much 
complex; they are in fact different. In a sense, 'for Us, the exper 
die test It^ms, and the subject cannot be clearly separated* 
son who administers and scores the test must ^e thQught^f^ 
of the, "foot-ruler,'' not Just the particular tes^^ ( f 

The us© of Ae classical model in- atte 
for exwnple^does not fit the fa#s? If we trf to use t^ i^Wel;*^ 



.would have tp^say diat many conditions (sjach as th# subjects moflva- 
tions, p|st experiences, attitude toward the exapiiner, the test situation, 
and a nost ojf otiiers) change ^ drastically "the size and shape of bodi 
the ol^'eet of mea|urement a^ of the foot-ruler" th^ these latter terms 
cease to have n^ning/Gulliks^ referred td this problem when, he 
said that "a si^OTcant cop^bution to item p^alysis ^^eory would be 
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ttie diicovery of item paraineteri tfiat rernained relftively stable as 
the ibm an^ysia. group ahwgedi or the discovery pf a law relating , 
the ehangas to item jmrametfers to changes In the group" (16, p. 392). 

Rattiar thm hying to make tKe old system of measiffement workp 
it might be better if wa would go back and examine our basic measure- 
mt ht assumptidiu, ^md modtfy tiiem and our tecKnical procedures to 
fit fliB Vequireinents pfrtih e-totel^ntagu^Baent-iitua 
ta be, . . . • ^ 

fin^y/Vatidiiy. The concept of validity has sometimes been treated 
even mora c^si^ally than ^reliability/ Ip looking through Mfiamring 
IntBlligB^^dX^), I fdund a referenc^to the fact that for the 1937 Re- 
visioii of the Stanford*Binet, itemi werd ;setected ^'that experience had 
^hown to yield high correjatlons with adcAptaWe meaiures of intel- 
ligence**^* 7). I wfis unable to find a clear statement of the ^'acceptable 
^f^ures of intelligence/' However, Terman andJ4errill^(36 p. 9) con- 
sidered validity to be of primary importance in selecting test items. 
Validltys ijiiy ^d, was Judg^^y two criteria- ' / ^ ^ ; 

(1) *'Iqorease in tfie'percenti passing froni^onie;ag^ (or mental age) 
S to the next, an4,(^ a weight^ baseJ on the ritio^f the difference 
to the standard error of the difference betweeh'ih^ mean age (or 
; mentaLage) of subjects passinj the; test and of lubiects f^Ung it. 
? 'llie use of such a weighting scheme was prompted by the pbviaas 
• advantage of being able to iutilize the data for ^11 ot the Subjects 
^ho were tested wkh a given jt^m^^ ' ^ ^ ; 

^ege authors go on ro point put th^^ \^ . 

"Increiaie in petcents pidsing at suc^llve chronological ages 
^ is indirect But not coflditftive e^denJe of validity, Height, for 
V example, .increases with age, but is known be praetibally un- 
con-elated withbrightness, Increase in percents^passing by mental 
age Is better, but exclusive reliance Upon .thi§ technlgue prede- 
termines that the scale based upon this critGrlon will measure an- 
proximately the saine funcUons as'^ that ^ used in s^ct^g the 
f mental age groups" (p. 10), ' ^ ' ^ ^ 

l<^^h^eems a rather scanty jiKtiflcatiorf of validity, in view of the * 
gSat expenditwe of time aJa^nergy that went into the standardization ' 
df this test McNemar laterAjw some %ht oa the "validity" of the 

-Stanford-Binet ^He said tbit TKa ultimate critjsrio^ of validity was 
cortfelation with mental age or its equij^alent in pcrtnt score#ori.th# 

. composite of the two scales" (25, p. 4). In liis P&ychological statistics, 
however, he says that ''the correlation betweeh two determinations is ^ 
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* * * twmed' tiie reliability coefficient" (26, p^. 128)* From such itate- 
mants it is nQt^j^e^ to me whefter a measure of valiidity, or a meaaure 
of fellabiUty, used In standardize^ \ 
But let -u$ r«^m to the Stanford-Binet Since the 1937. Revision was 

The Mea^tifnent of JnieMtgenc^^^SS). Nowhere did I find any cleM" 
statement of bis validating prp^dure* Terman's assertions jfliat "the 
v^dity of ^e I»Q« as an expression of a ctiild's intelligence fMtus « ~ , 
; follows neeessarily from die similar disbibutions at the various ages" 
(83j p. 68) did not entirely satisfy me, and his statement' that "a test 
wbi^ QiaJces a good sUowing on dils criterion Of agreement with ^e 
scale, as m whole becomes immune to theoretical criticisms* Whatever 
. it appears to be from mere inspectioij, it is a real njeasure of intel- 
ligence" (§3^^ p, 77) left me with the feeling that Terman was really 
making a case for "faith validity." However, he did report the coirela- 
#tioiL between fte LQ, and teachers' estimates of the children's intel- 
ligence,, It was .48, which "is both high enough and low enough to be 
significant. That it is moderately high in so far corroborates the tests, 
^at it is not higher means that either the teachers or the tests have 
made a good many mistakes. When the data were searched for Evidence 
on this point, It was found / . , that die fault was plainly on the part 
gS die teae liBfs^' ( 33, p, 75). The correlation between I.Q. and school 
success was given in another source (34, pp. 104-6) as being .45. This 
correlation is not startlingly high, but in view of the great amount 
of rattier prosaic rote learning and recitation required in our present 
public school curricula j I do not think It would be^ very flattering to a 
test which puiports tcf measure an individuars complex, adaptive, 
higher mental abilities if it correlated too highly with sfihool success. 

'niere is another point that I would like to mention with regard 
to the standardiEation of the Stanford-Binet. We are told tiiat in both 
tte 1916JRevision.(33, p. 52) and the J937 Revision (36, p, 15) "schools 
of average social status were selected in each community/' As we 
know, this means middle-class schools, which are usually attended by 
middle-class children. This is a rather serious sampling errors since 
research has^hown that the concomitants of social class, the range of 
experiences, jnotivations, etc., in turn influence performance on our 
prfesent intelligence tests (e.g., 4, 8, 9, 13, 17, 19, 22, 32, 40), It is 
analogous to die error that we would make if we wanted to determine 
the social and emotional behavior of individuals of all ages up to 
maturity, and took an "average" age sample^ namely adplescents. It 
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does not follow ^at they would be ttuly ppresentative of eidier young 
e^dren or adults. The use of "average . ichools" would have been a 
.good sampling ijhort-cut"i/ tfie factors whidi influence intelligence test 
perfonnance ware^random^ disbriButed along a linear contiiiuuin of 
sucfd^stetosr^Agitap^hette 

answered by die statistidani such Jfcisions fniist be based on socio- 
anthropblogical field work tmong various social-itatus gr^ujrs in our 
society. , ^ ' ' 

, A further sampling problem iniconnection^with die 1937 Revision 
is seen la the fact that Terman md Menill lelected their^ standardiza- 
t^on groups from whitei native-born fihildreOs pnd on the basis of census 
Honns for employed males in 19S0 (36^ p, 14). Even f^ tiiis biased 
criterion, tHey selected too many children from high-status, and too 
few .from low-status families. The fxtent of the discrepancy for the 
seven classifications used is indicated by a chi-square greal^ than 500. 
Because of such sources of bias, W* L. Warner (38), on dfe basis pf 
his research on the differences in cultural behavior and experience 
patterns of various social class groups, estiiriates that the Stanford- 
Binet shouldj on these grounds, be appropriate for testing fifty, or 
perhaps even sixty-five^ per cent of jhe children in our population. 
2. The confusion of facts with CLrtifac4B and the gemration of pseudo 

% ~ ~ ~ - "... 

The history of intelligence testing in America has been fraught with 

a series of violently contested "issues " I would like to mak^some 
comments about die "so-called "constancy of the LQ"— although I might 
also have chosen to speak of the "nature-nurture controversy/' WTien 
some people speak of |,Q. constancy, they assume that the person, be- 
cause of his genetic /iiheritancej is born wth a given level of intel- 
lectual potential, and for better or worse, it is his for life. It is even 
more constant than, say, one's hair because one can dye, curl, or even, 
lose his hair— but die LQ. remains faidiful to the bitter end. Terman 
expressed this position in speaking of gifted children thus: "Tljeir high 
I.Q* is only an index of their extraordinary cerebral endowment* TTiis 
endowment is for life. There is not the remotest probability' that any 
of these children will deteriorate to the average level of intelligence 
with the onset of maturity" (33, pp, 102-3). It is this intellectual "som^^ 
thing" thatjs said to be measured by intelligence tests. 

What is the evidence for this belief, even though many investigators 
have found, or some investigators have found many times, that chil-* 
dren often do receive die same LQ. within rather narrow limits whan 
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to say, when groups are retested a^ter 



Irom two to five yeai^ one can he reasonably sure t^t tiieir l*Q/s 
re ^anVbout fi\ • - - i. _ ^t- ^^^^ 



vaty .po| more ^an^out Hvp 
, j^^evioui score. Buttressed with 



Will 

poinfs in either direction from their 
such flndings, It has apparency qeen : 
the/'I:dHrthrTriaiii^^ 
being determined by heredity'' (1 ?j 302)* 

; ^Now, let us look .behind thee e '^flndSngs'* for ^ moment, and 4 bo 
Aey were obtained, I£ we test ard rete^ average middle-class chilJren 
from classrooms j 'we can be almojt cer^in that they co^e from faniilies 
that are stable in tfie neighbprliiood, maintain a certiiri style of life, 

*and tBat the over-all parentid values.Vtheir systemi pf re^vds and 
ipunislwn6nts; their expectancies for their children, t^e type and range 

"U^p^ripmm of the child, thefstrehgth of his desire|^ to do wrill in,A 
and BQ on— that these conditions will remain ajbout the^^afne^^ 
test to retest. Thus, evert i : environrnental ftctors do influence 

^^intelh'g^nce test performance, W€ would not be able tq observe their 
effecte under such^nditions,lsii ee suq^ effects would be held rela- 

- tiyely constant; Some thepriitf remindiMe, by their logic, .of the^manj 

^ who Aoifght Ws thermometer was stu^ because it always gave the 
same reaing, even diough he kept his hoyse at the same temperature| 
Howev^, in his discussion "of occupational differences |n obt^inec 
i,Q,*s on tilt 1937 Revision, McNemar's point is well taken that ''a 
quarter df ^ century ago such fn ^eumufeion of data as we dan [here 
present, would h^Ve been hailed las demite proof tl^at "intellectual' 
differences have'an hereditary basik but afAhe presetit-||im5 these data 
will not be regarded as o{ crdcial significance in ^ field of controy|rsy" ' 

(25,p.35^: ' ' : ^ ] / ^ : ' 

In short, I am suggesting mat the "constancy ' may have been, an 
artifact of how we obtained our datsL and that unwarranted generaliza- 
;,tions |rbm such data generated an r issue" over which| we spentrtdo 
much adrenalin, time, and energy*, mere are other %sups" that would 
probably fall into the! same class, and social status is one of them, 
i think diat such dontrovOTies^l^uldl cease to be of central importance 
if we knew more about tiie natur%of me phenomena we are attempting 
to measure, if we were. better able lO formulate and appreciate jAe 
relevaiit parameters that concern what we mean by "problem-solving 
ability," and if we couM learn ways tb cut through variojus aspects of 
the testing situation which are actually irrelevant to ouri basic meas- 
ureinent pulses, but which Can contaminate our test items and 
stand^ization proced^eSj and sometiijries havie. 

^ [102] 



TESTING PROBLrEMS 



Earlier In th|3^ pftpe^' J quef tinned the long'range value of tiryinj to 
be too ^^GientiEq^ too jjoon, of setting put to develop tests that ap 
justiSed primarily in terms of tiieir ^atistieal Gharacteristics, and that 
may eoirelate with aonietiiing, I haye also stated my belief that the 
"Tleit^piFWTFofly^XpBFof^ 
must concern oui^elves* lliese points will be touched upon later Sn 
ttds paper, ^ 

The reluctance to conaideT approaches which deviate from orthodox 
theori$B and techniqueB. ' / ' - ■ 

This tendeney is of 'importance only in so far as it serves to impede 
•seientifio progress in tjie devel^ment of new ideas and knowledgfs* 
\From time to timej lie^ approaches to basip problemSj or a questioning 
^f the established "facts" in a fleldj meet with rebuke or' censorship, 
S^jetinles they are Juslifled and sometimes notj but thi^ tendency 
oecutp in every field (rf*, 39^ ch. 4), 

In i^is connection, I decided last week to check the reviews of the 
Eells and others^ tntelligence and mlttiml differenmB (13). 
I was ablB to locate ten reviews in general scientific^ psychological, and 
sociologi^l Journals, It was apparent at once that all of the reviews 
except tho^e in the field of psychology were either mere factual re- 
porting 0^ the research and ideas^ or ya^ Javorable, with such 
lauditory statements as "the book^'might weji^serve as a model for 
sdfiiial science research" (24^ p. 45); or "this very Important study will 
be of great interest to psychologists "as well as to social scientists, 
particularly to tiiose conc€||(5 with constructing and giving tests" 
(30, p* 209)* The reviews in psyehological Journals were somewhat less 
enthusiastic. * ^ ^ ' 

Could it be that our cqlleagues in other djsciplineSj such as spciplbgyj 
arelnot SufiBciently familiar with the problems "of this fields or suf- 
ficiently knowledgeable to judge adequately the value qf such work? 
Perhaps. Qf the reviews of this book in psychological Journals that I 
have seeUj McNemar s criticisms were tlie most Just, although his praise 
for its value was barely audible, McNemar closed his t^viBw (27) with , 
the statement, "Eells, perhaps in tune \vith his mentors, concludes that . 
Variations in opportunity for famillari^ with s|seeijBc cultural words, 
obJectSj or processes, required for answering the test items seem , * , 
to be the most adequate general explanation for most of the findings'" 
(p. 371). I. wish McNemar had seen fit to cite the following paragraph, 
also on p, 68 of this book (13), which reads, . 
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^ sefsms likMy tibat itatuy ifferai 
/ lart items fie not due solely to r 
* ^© WB$^t of various tj^pa? 

Jeaetia or 4^velopinental alj^^ 
andL : tod toO^^TOftl Wid m^ 
oftfit,^ In^pretattbn of LQ^ 
i dlftaring /%iltoal badc^unds ' 
CTto^© caution,- 



iS© to ^talligence- 
nple'^ause but ara 

ibmty, on to one 
IS in tests J on 

I, ba made with . 



or 4e rtatement dn p 

bar irnportW 



■ I. ^ ^ 
V which reads, ^ 

^ ^*i4ing of ansdy^ii , 

is lie f attfer suW&tti^ number ©f item 
/ diffw^tt^ fcr wW0h^w reaspnable explaw 
T^e present of luch a la^6 proportion pf 
shouM, however, lead to caution in^^accr 
stajtas dUEerandes on test items, can^ be ri 
tfenofis of to^turid bias of tHe^dbnbnt"^! 




d in this chapter 
ring iMge status 
[on can be seen, . « « 
iplainad dlffer^cii 
ig the idea Uiat^ 
]f accounted for fin 



I point ftlsdut because mMiy'^sons^ave ^^pjkenly assumed dat 
tto Pfople at the University of Gh^^o believe^at tiie infl^j^ce of 
heredity js not refleqled li/intelMgenca test sqDres, No one at Chicago 
aver said that, but rather ttiat.our prese^^^r "pleasure" a very ^eat 
deri besides hereditary pottnt^al, « 
' Another instance of ^el^ctance to accept the "Chlqago Studies 
e^tf to riiy atteutioo Iwnf d diat a rese^cA report 

from there was not acoep^ed by tibe editdr of a well^known^ gfycho- 
logical publication. l^m^B. littie takek ab^ck to read, ambhg o4er 
Mngs, die editor^s cofflment as f^dlows|ri gu^ss that^tlmrs somediing 
diat feotAles me moitr the feet ttiat Ae iiji^^^^^ oi^s itudy to ; 
so grossly different from what ienerally believed," THii comment, 
* aldiough perhaps more reveding, is liardly more encouraging to 
progress than the one of an editor who all^edly sai^ ^ere Is your^ 
paper, somebody wr6tej6n it." However, such a finger-in-die-dyke ap- 
' prdach Is futile, espeeipyNvhen the main stream of scientiflq Aought 
and mettiodology has long since gone in^tiother direction. 



Part III 



It would be impossible to ^ttempt here a sjirvef of tiif mass o£ re- 
search flndlngs which bear directly or indirectly hn the topic of bias 
in our current intelligence tests. Such a sur^y wotild have to draw on 
materials from such fields, for example, as sociology, anthto^Iogy, 
psycholpgy.'psychoanalysis, and edueation. It would have to, deal with 
tfie Host of factors tiiat touch on the broad problem of herw people 
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come tQ ^6have fte way 'they do, mnd why. I Hroit tnykeW here 
IP only a token saniple .Qf ^Ae Wnds of findings in flie ^ypologica] 
literature Aat are obviouily relevant to our j^fbblenl , Thieve inplude 
tt^ rdafe of tibia ind^ behavipr, 
especiallyNEis tiiay ettect intelligence 'tet performande^tWrChirfl, 
11)^ tibe role of karalng frlearn 'in eflfectiye 

(18), and*^e^role'bf etno^m^ or ^sonality dttturbances, eipedially 
^ AeyresulijlninteU^ctud mdfunct^^ \ / 

' I woul4 ^however/ like fo discma some reiearch findinga which bear 
more directly m the problem of bias In faitelligence tests. Thin expert- 
f^mt^ W a p@rt of a larger ^ r^ewqh prpgrain. that has been gojiig; at 
the University of Chicago for seven yearsi under the leadership of 
Allison Davis. The experimerit 1 will discuss was ^igned to investi- 
/ gate experimintally some of the mtay factors* which are kno^ 
^ culturally detemined, ^d Wh^ch influene^.the performance of chil- 
dren bn our present intelligent tests/The factors which, it was felt, ^ 
could be studied realistically md eopta'oUed experimentally are 
fomulated in terms of die following eK^rimental conditions: (a) social- 
status, (b) practice/ (c) motivation, (d) the form of the test it|ms, and 
(e) the manner of presentation of the test items (17). 
You have already been given a brief description of the major 
' variables and experimentar eonditibns, the matching variables, ttie 
control variables, and how the data were analysed (See Appendix). 
To save time, I will go directly to a summary of some of the major 
findings of this experiment They are as follows: 

1. "The condition of Practice facilitated the gain in performance of 
the high^status children who took the Standard form of the 
Retest, and die gain of the low-status children who took the ^ 
Revised Retest, , . 

' 2; "The condition of Motivated Practice interfered with the gain 
in performanae of both groups of childrea who toqk the Standard 
Retest; this was especially true for the high-status children. 
3. "The low-status chHdren, when motivated, did significantly better 
on the Sttodard Retest^^than the low-status chil4ren not thus^ 
\ motivated. ; 1 X 

) 4 Children from both social-status groups made much^ greater 
^ gains on the Revised^ as opposed to the Standard form of the 
Retest, with tiie low-status children showing the greater gain. 
5. ''Some item-types (e.g., Analogies, opposites, classification) ^an be 
revised more easily than others (e.g., syllogisms) to reduce 'mid^ 
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. • ^; ■ "= ^ •i ' [a, ^ ' 

dl^da^ bias/ / , > * ^ ^ 

6, "Ghildreo from Bo A social-itatus ^roup^ perlbrmed better on 
' ^ RevisWd Initial Test tiian on the Jtandffd-type Initial Test 
^Hi|^-^tetU3^^draxl showed k slig^^ greWter gain when thpy ^ * 

T^rffirH^itt^^FmttewaiOTa^(Sii^ 

tiie Ipw-itatui ^ildren showed an additional Jain in perfornfan^ 
when die RevfaedTest was flls6 read aloud to thein. 
8/**Bie Initjal Test and Retest of 40 items wera not given undbf 
sbrong presiuM^f time. Many more children from boUi social-^ 
status p^oU]^ prised toe test items than one wQulrf expect from 

frHft fi fAtfidftriji^^ nTi nnrms for these items. ^ - - ■ ^* 

9. "Even &ou^ ttie .various experimental ireatments find fcondi- 
tlons influenaed th^ ratest scores of diildren in ttie twb social- ^ 
status groups differehtiUly, when die effects of oil inch trea|t» 
V ments aAd^jaiitions were thrown togedierj there was no sig- 
* niflcant dtt&mnOB betwlen^tha two groups of children in^dieir 
abihty to leam to solve intell^nce test*pr 

"Children from both sqcial-status 'groups showed greater gain ^ 
in. perform wi^Avh^n tested on tasks and under conditiops which 
were'relativdy more. familiar to them, . ? ' ^ - 

11. ""nbe mere revision of the test itims was not in itself suflBcimt to 
reduea^ the difference In performatice beS^een the hig|||^his ^ 
md low-status children, The marked discrepancy betWiien the 

' two groups was only decreased >^hen the conditions 'of Motiva- 
tion and I^actice w^e also present— that is to say, when .there 
was aJsQ a deCTdase' between the two social-status'groups in the * 
difference in their familiarity with, and motivation to do well 
on, the test Items. ' . 
1£/*^A11 of the statistically significant differences attributable to the 
conditions of Practice Motivated Practice, and Motivated Retest 
occurred in connectioo with the Standard Retest, The Revised 
Retest was not so ihfluenced by such conditions, w^ich are es- 
sentiJly irrtlavant to the measurement of mental ability, but* 
which are^detenninedin large part by thi concomitant 'of socia^ 
status'" (17), ^\ \ I * 

/ ?ART IV ^ ' f;t ^ 

The mere existence of this panel on unbiased tesp impli^ some 

interest in the measurement of potential (e.g., for ab|tract reasoning . 

or problem-solving kbility)^ and some dissatisfaction with tests whosq 

eisentiki lustilBcation Is in terms of Vsome criterion of expeiffiicy. 
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Furthermpra, by potenti^, I assuhfe we mean tot, an individuars 
"poteHtial at the time of testing, in^terms of hi^tiieoretedly maxhnal 
abflity to perforrtij rather than lonje hypothetldiLL jnnate, genetic 
^ ^tential, and second, tiiaf th^ole of soclal-itatus is one of the faetors 
"^thatinterfare^ A oiir pf astie w — 
. Hep are two broad levels on wW^M^e mky approach the solution 
of tharpirobleni of poilible bias in^e'Sts which atteinpt to measure,^ 
for example, intelligence/ One is the theoretical fevel, or how we 
" conceptu^ze *and formulate -^our research problems^ .^he ether is die 
* tecSipc^ leyel, ot what speciflc knowledges we should acquire, apd 
what steps ^e can and should i4k& In attacking our problems. I Will 
^diipuls each of diese levels briefly. ^ . ' • f , 

The Level of Theory, Frrt^ time to time this afternoon I have made 
tfie classical physical sciCTce 'conceptusfl model out to be a scape- 
goat, the source of our ills. Some of you may say^ I iave carried this 
point too far— and I would agree with you. A more accurate statement 
would be that if we had been more arttAalate about some of the as- 
sumptions we have unwittingly m6de, w/ would have ceased to make 
tfiem a Ipi^^me ago. I real|y think ou™hief weakness has been th^ 
assumed we were being "scifentists^ because we performed soifU^B 
of the scjentific rituals, an^we assumed ofir "facts*' were valid be- 
cause tiley were stated in quantitatiye termp. Others of you may say 
that I have beefi ufr^sttBed in spending so much time talking abotit 
' vague "theoretical .con|pptual schemes/* tnat we have Nyork to do, so 
le^s git to it. The besl, reply I know to thi% position was made by 
Einstein^d Infeld, ^ha said that rthe formulation of a ^oblem is 
> often nipra Essential than its solutiO|n, which riiay be mitily. a matter 
of ' mathematical or experimental. stUL To raise new questions, new 
^possibilities, to regard old problems fronr^pew angles, requires creativa 
imagination and quakes a real advat ce in science" (14, p, 95). 
Frarikly, I d^^^ot know of any con^^^l model .^ich is fully 
^articulated and appropriate for thiEi fieM of testing. Birt I think it is 
clear that the system we have beer usin^a quite inadequate and in- 
app^priate, as I have tried to poir t ouf^from tinje^to time this after- 
^ noQn. Berhapsl can be more exp^hc itby tha'use'of J^n example which 
' js a^mittedTy exaggerate^. It is a tase cited^y Boas'(5) of 'a certain 
psychologist who, asked a native for kfiB name^of his mother. On re- 
9eiving^e answer *Wlwm do you mean?' he marked intelligence as 
zero, beeause tiie man did not ksev^ his own modier. The psychologist 
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are desigQated in die nativa Imguage by,a stngte^erm, and the sifua* 
flon did ngtm^ka it dear Uiat the own mother was meant" (p. 14)^ . 
This eximple, differ f^nljr in degree, I^b^ve, from many praqtlices ^ 
dmt have existed in the adnuntotfation and .intg^retation of intaL 



In QOnoluding Jhis section I would like to saytha^ince tfie tinie 
Lobad^ewBlQr in geopietry, Boole li^^gebmj' Maxwell in physieSj and 
Spemann and Wel^s in biolog^j more flexible, gener^, and, useful 
^eo^fticri lystams have "been *devdtoped.. In lookip"g tiirouglf r(«ent 
issues of die A?' A, A, S, publication Sdencej Irfo^nd. a numbsr of 
paTOrs which may give sqmd helpful dkeqtion to our th|nking in this 
flela. Some have to do widi applications to such fields >Ss phy^cs (6, 28)i 
and genetics (12), But papers which J believe hftye r#hir direct 
relevance to some of the problejfns that we are. confronted j^im include 
von Bartaliriffy's discussion ofl tbe concept of open systerps in bi^ogy 
*{3), Bentl^'s use of the taniactional approach ih' th|^ generaJ||^eory 
inquirj^(2),'an^a series of three papers by Cantril^ Ames^ Hfistprf, 
and Ittelson^n psyclwlogy and scientifle research (7)^,Siich approaches 
seem to' me worthy of our careful opnsideratipnj in the hope that tiiey 
may h^ us better to, formulate our problems, and look f oi' (and per- 
haps find) more fundamental somtionsJo then^, - , 

ThByLkvel of'Prhctice^* WBat do we mean by bias? In a gimerd 
sense/ whenever we speak of bias^^^efe^ to the inliGn^e an piir test 
scores, o£ factors irrelevant to the gurpoilp of our rAaasuromenl , and 
,w]wch can change any of the mon^nts of our score, distrft ution. The 
degree of bias and purity of measurej or vklidityj are hwers sly related. 
Various suggestions have fc.een made for dtvdiopftig^mhiased tests, 
Binet (4), ^e pio/eer in this field, was the first to point oi t thai tests 
of intelligence mould be free from tiie^ influences of vatious kaowl- 

*Y^flni Bj^ara that it is possible to makt a casi for teils (aLwhifclL are jiistifle4 
in terms of soma eriterion of expendiency stfch as prediction oT school ^ucdess, or 
(b) wMih are limited -in th^ applicability to only a^segmitit of the population, . 
such as urban or middle-elass groups, One 'certainly can argue for the eidstence 1 
of such tasts^ but in doing so I believe that the test construgtor md the . test 
publishar wa oyigated^o. make explicit and public answers t0^€uch 'questions, as 
th© foUowing? Is mis i^cieht ius^catiqn*.for the existence^of the ^'intelligence 
test," since previous ^pide averig© would probably predict future school supcess 
as well as, or pirhap^ tptter ^than, such a test? Are the^onsuriiers of sUch a test 
led . to belieye that it is a leisure of intellectual .MSMftial or problem-Sblvin^ v. 
afiility, and ^|0 act upon thi^belief (cf. 35)? Will such, a ttest be limited in its use 
;tQ Ae^grouprtor which i| is appropriate, and fi^'not, wiU'the tnisap^ieatioapf teh 
^ %;^'test s^^ tb perpe^ti^^the' current ^wastage of Jarge reservoirs of intellectual, 
potentid in our society?^' ^ J;^ sr * \i 
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edges, sySls, language uiagtfs^ and other aptitudes which result From 
^^pecifle truning in the home or school. In attempting to develop a 
te^k which was not contamiftated by such experiential influences^ Binet, 
fta^y in his n^i^^r^ ^H^^mpfr^i^ ♦^^li^^^ sii^K nillhiriilly hlased tasks 
from^histestbattety. * A . > f. * 

In Ms counttyj Thomdike and otjiers (37) set forth a few prinemles 
to serve as guldeposti; First, they suggested' ih$i "'intellect is the IBuity 
to learn, and tiiat qur estimates of It are or should be estimates of 
ability' to leam. To be able to leam harder things or to be able to 
learn the same thing more quickly would tiieit be the single btois of 
v^uation" (p. l7), Jn terms of constructing intelligenge tests, they sug- ^ 
gqsted that '*the wisest proced^i at present is to equ^ze envfron= ^ 
mental forcei by usjpg a wide variety of data with which all indwidu&ls 
. have hid adequate experience*' (p. 462), Oth^sug^stions included 
the use of novel taskSj'^'so that at no persorf'wil^^ve been taugjfit " ^ 
to do that particular^ task by egvij^onmental force^^(p. 437), . and th^ / ^ 
use of "tasks that are so familisLr that ^e^^WSyi^^a^^^^ somiewhat ; 
nearly adequate environmental stimulation to, paster them^' (p. 439). 

Recently, Davis ('p.g., 8, 9, 13)^ ^rgely^orrv^e bakis of extensivg^ 
research^in this field, has also dealt wit^ this problem at some length, ^ ^ 
and.has reafiSrnied and extended the early ^position of Binet. DaVis (9) ^ 
states that I , , . r ^ - 

' "The crucial problem raised, by the attempt to Qonopare scientifl- * 
cally the capacity, of any two individuals to learn ns that of finding 
situations with, which the two individuals haye had^- equal ex- 

f erience. To state this issue more exactly, two 'major systems. ; of 
ehaviQr are invofved in problem-solving, The.y are (a) the in- 
/dividual's gehetie equipment 'for problem-solving; and^b) the 
individuars particular cultural experience, training, and motiva- . 
tion, which have develoned certain areas of his mental behavior ^ 
ancl certain skills more fna^others/ In a test of general hereditary^^ 
capacity the second factor must be equalized for all those tested v ^ , 
..^ {p. 301). ^ - , . . V : % / 

It will be noted diat DavSj has repeatedly emphaiized that the condi- 
tion of "equ^ity** must be:, expanded to include such considerations as ' 
the manner in w^ich ^he test 4s. presented to the. child, his attityde 
toward l3ia testing situationj and, his niotivation to do well on such 
tests--as well as equality of experience in relation to Uie fdrm and 
content of the problems used in the jest, In constructijig^heir Tesj 
'of OmBral IntelligBnce^ Davis and Eells ifted test pfeblenis which are 
"(1) taken from the, major areas of children's experience and (2) which 
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art i^t lik# to'Mve been previously Jaijght in home oi school-* (10), 
■ llow can we approximate the" various types of equality necessary , 
^^-A^establfth a mlnlmai degree of bias in our tests^"^ wiswer this qu^ 
• ^pii inlSiPOrtant flHt ^" nwry it ^r^the extent to whicb we can identity 
i ^MJOssible tias variatie. Some variables, such ar whether a persmi » 
'iir^ or femde. are eMy^to identify; others, such as social class or 
I ^Sepential background, are not so easy to identify. Jn the former case, 
' - if want i test wKich is not biased in favor of one sex, our task is 
eSy. We can, flMt. wtaip only the.items which do not show a s^ 
v^di£Eerenc«statlstically, or second, we can balance our 
>• e«di sex is favored equdly. Ar third, we can meamre m^tal proces es 
Which are independent of sex dlSerences. Since the first and third 
^ssibllMes differ most sharply from a methodologies point of view, 
: ,w us consider them. In the case of a variable such as se^ ^ese pps- 
' sifiilities ulttmately achieve esseripHy the same resdt. The tesf; and 
all itJifems sho^y no bias In faVor ofpne sex over the other. _ 

It Ik importanrf^o note, however that this equivalence o fin^ result 
doi-not.^cessarily hold when we are unable t^.'^enti^^e ^a 
variable. Under such circumstances we cannot say witli ^nflden.J that 
the fir t method Will give us the same result as the third ntethod. For 
rxample if test constructor is unaware of the existence of a variable 
such as sofal fclass. or has no way of measuring it, or ignores it, his 
staHsHcal Xoeedures will provide no safeguard agaipst its entry, as a 
source of bfiisto his test. (The%me argument would appbj to such 

■ p^^ible bias variables as ethnic background or nird^^ 

As amatter of fact, as I suggested earller,_sudh bias has been known to 
turn ub in the milse of "empirical findings." 

rshaU conclude my discussion with a consl Jeration of two questions. 
First what:do we kno*. about the presence or absenc^^of bias {or 
Equivalence" as I have used the term) in our present intelligence t^ts; 

what are some of the considerations that must ^tak n 
- , account if we are to minimize bias (or maximjze equality ) m 
i Mture tests of mental ability? , , uAu ^hU 

' -1 Sowfrta of bms in prew*^ There has been remaric%Tlttle 
. basii research iathis.area, iA'fyi^ of a recognition of the importance 

■ p^^roblem • On the bMls of varidus ^u^ies (e.g., 13. 17), however, 

(34, p: 135). . , / 
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wa are able to say witih some confldence tfjat, by and large, * 
a. praient standard ^sts of inteiliganca do not* really ^ept ; 
«ny of tiie above criteria of "equality/' but rather they we 

and type, of previoiis fxperience witib the^coAteht/ language ^ 
usage^jWc,, gf ^Ur presant^tes^,*^s well as the clHids ^otiva- ^ 
-. tion to do well oh them. ^ 1 \ 

b* A veiy large proportion of the item t^es characteristically 
found m our present intellfgince tests cannot be niade to 
damonstrgte "equality" by % verbal face-iifting* The academic 
* naturb of problems mnd their conttnt, as well as the manner in 
which the problems are presented^ are suflBciently artiflcial to ^ 
preclude their use, in tests which are unbiased fon large sub- 
groups in our society* , 
2^ Sorne c^ndd^rationB for the developjnBnt of unbiased tests. In 
setting out to develop unbi^ed testSj we can be certain that there is 
no simple set of "techniques""or any rule-of-thumb approach to tiie 
problem. Actually^ our task is complicited by the fact that there are 
certain aspects of our problem aboit which we can do nothing, but , 
which , are potential sources of bias in our tests. These include the 
totality of cxperientiaL and cultural heritage which flit child brings 
to the testing situationj and which may range from possible functional 
deficiencies ^suiting from early nutritional deprivations to specific 
training on various types of tasks found in our tests.^In gny case, it is 
safe to assume that **dffierences of early experience can produccr^^ 
differences in adult' problem-solving 'that further experience does not 
erase" (19, p. 299). , 

But our task is by no means hopeless, because there are^a*great many 
aspects of our problem that we can do somithing about* For all 
practical purposes, I believe that a good point bf departure is to re- 
evaluate the following aspects df the testing ^nation for their possible 
contribution , to bias in tests of mental ability: (a) the construction and 
standardization of the test; (b) th^ nature of the mental processes 
measured^ andjheir relation to effective bfehaviori (c) the attitudes, 
value systems,, niotivationSi etc, of the Jjersons taking the testi and 
(d) the manner in which the test is ^presertted* These aspects of the 
testing situation cannot, of coutse, be clearly se.^rated, but for the 
sake of this discussion I shall attempt such an artificial division, 

a/ The construfction and standardization of the, test. By and large, 
, we can be certain that the^ ttst constructor is a middle^class 
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inavidual, being a professional person \vith Qollege training. 
His habiti of •thought, language usages, etc. reflect his middle- 
class; culture, and when he writes t^t i^ms, they too are. likely, 
to reflect tW^ background^ The way for 'him to avoid ,iwry- 
tdwer item writing is to learn, on the basis of research, as much 
as he can about Mow dther sub-ciiltural; groups in our sQciety 
live, tiie words they use, their mbanings^.etd, and then to write 
iterns which 4o Wt favor one sija-group more jhan another. 
Earlier in 'tiiis •paper, in discussing such topics a? VaUdity 
and item dlflBcult^ It already considered posSble-sourees pf 
biis that might arise in the standardization of inielligence tests. 
The nature of the mental processes n^asured, and their rela^ 
^n to ejBEective behWon It is ^ot always clear .just what 
mental processes ar,e measured by ^ "intelligence tests " ^qp 
whether the same processes are measured for various age levels 

(cf. 20). \_ _ 

Some, research findings bear on this question Jin one study^ 
. (1), children were asked to give the reas^s for their answers 
to inteliigence test items. In the case of one analogy item, 35 
of tiie 60 children tested marked the "correct"^ response, but 
not om of these children gave the "correct^^ reason for marking 
it Hie. reasons jivun were on the basis of 'rhyming, synonym, 
etc., bbt not on the basis of making the analogy-the process , 
which the test construclbr assumed was being measured. 

In another study (11), the test constructor wrote out the 
mental processes he thought were being measured by the ^ems 
in his published test. It was found that for some items over 
fifty v^T cent of the 152 nine= and ten^year old children gave 
logiqally defensible reasona for marking answers considered 
"incorrect*' by the test constructor. Furthermore, whenever 
more than one logically defensible answef to an item was given, 
the middle-class children tended to give the ^'correct" answer 
(in the opinion of the test eonstructor), whereas lower-class 
children tended to give the ^'incorrect'' answer (in tlie opinion 
of the test constructor). 

Perhai^s an even more fundamental question has to do with 
whether the mental processes puriwrtedly measured in in^ 
telligence tests bear a close relation to intelligent, effective 
behavior in life situations. Boas (5) defined the intelligence of ^ 
ile in terms of "their ability to adapt themselves ade- 
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quataly % the problems of their life" (p. 11). In a general sense 
this appears to be a defensible position. But if test items ^are 
selected primarily in terms of certain statistical^iteria,* it is 
not certain that such tests will predict intelhgent behavior in 
this more general sense. 

With regard to this problem, Davis and Eells (10) take the 
posifion that, ^ 

"In real life, the types of mental problems v^hich the in- 
dividual actually meets can seldom be solved by reference 
to specific instructions or memorized formulas . . . me 
^ dividual has tb learn how to organize his ov^n data, to learn 
how to define the problem, and to learn how to develop a 
jnediod for solving the problems as defined. He is^on his 
own'; he has to find a way to solve the problems:' 

ConseQuently, . ^ 

*^An intelligence test should approximate these conditions as 
nearly as possible/ The test should be designed to measure 
what an individual can do in solving mental problems similar 
to those which arise in his general experience/' 
c. The attitudes, value systems, motivations, etc. of the persons 
taking the test. It is clear that many groups in our society ap- 
praise the testing situation dlflerently from middle-6lass chil- 
dren, and may feel intimidated by, or not motivated to do well 
on, our present intelligence^tusts. It is also cledr that such "non- 
intellective" factors as rapport, attitude, and motivation sub- 
^stantially influence performance on intelligence tests (e.g., 4, 
8r 15, l'^* 21/. 29, 40.) Since the child's prior attitudes and 
motivations cannot be changed at the time of testing, then 
perhaps the testing- situation can be made minimally threaten^ 
Mng and maximally motivating to all children. This would 
necessitate both the construction of tests which are,. in them- 
selves, maximally interesting and motivating to all children, 
and the creation of a favorable "atm^osphere" in which to give 
the tests. To use problems which are meaningful to all the 
children being tested would also serve to stimulate them to 
use their problem-solving ability to a maximum in the testing, 
situation. 

♦Tennan (33) itates that he eliminated certaiA tests from his battery "which 
have been considered excellent," becai^e they ;;provcd tobe so litrte correlated 
vdth Intalligenee that they had to be discarded (p, 56). Ho deflned intelligence 
hare in terms of the score achieved on the total sealc. 
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d. The manner in whiAi the test is presented. If we are to de- 
velop unbiased tests of intelligence or problem-solving ability, 
it seems clear that the tests should be so presented that all the 
children tested have an opportunity to uiiderstand equally the 
problems'to-be-solved, so that they can utilize their problem- 
sorving ability when taking the test. 

OnB aspect of this question has to do with the academic 
.vocabulary often used in presenting the test items. In at study 
based on Sll cases (82), it was found that when the words used 
in standard intelligenc|^ tests were made into vorabulary tests, 
fr^m two-thirds to three-fourflis of the terms were better known 
(P = .05) by middle-class than by lower-class children. Such 
a source of bias could easily be removed by presenting the 
test problems in terms which, are equal in familiarity and 
meariing to all children taking tlie test. 

Another important aspect of test presentation has to do with 
the emphasis placed on speed in many of our intelligence tests. 
In this connection it has been pointed out that 

' "Speed is influenced botb by cultural attitudes concerning 
tiie importance or unimportance of §peed, and also by per= 
son^ity ^nd motivation^ factors, such ias competitiveness, 
conscientiousness, compulsiveness, cKhibitionism, and anx- 
iety" (10). V 

It was found iii an experihient reported earlier (17) that many 
more children from both, high- and low-status groups passed 
the test items than one would have expected from the stand- 
ardization norms. The only possible explanation for this finding 
seems to be that only forty items were given in the testing 
period of fifty minutes. This allowed the children to pass many 
items they would not have been able to pass under speeded 
test conditions. It was also found in thiS; experiment that when 
the test items were read orally to lower-class children while 
Aey followed in their test booklets, they passed appreciably 
more of the items (P = .07) than matched groups of children 
who took the test in the traditional (silent) manner (see Ap- 
pendix). Such findings suggest that a wide range of conditions 
exist in our present tests and testing procedures which serve to 
introduce bias into our present measures of intelligence. 

Finally, the emphasis on speed (rather than power) in the 
measurement of intelligence actually results in the confound- 

[ 114 ] 

■ 111 



TESTING PROBLEMS 

log of such factors as reading speed, previous famUlarity with 
the content of the test Items, and rote or incidental memory 
with problem-solvlng ablhty. In attempting to develop un- 
biased tests, it seems desirable to remove such sources of bias 
from IntelUgence "^test scores, especially since previous ex- 
perience with test-type materials Is enjoyed differentially by 
' various groups in our society, and the correlation between m- 
cldental memory and problem-solving ability is negligible. If 
indeed these two variables are not negatively related (cf. 31). 

SUMMAHY 

"The standard-type Intelligence tests are inadequate on several 
counts. Among other things, (a) they have measured only a very narrow 
range of mental abilities, namely those related to verbal or academic 
success, and have ignored many other abilities and problem-solving 
skills which are ^rhaps more important for adjustment and success- 
even 'in middle-class society; (b) they have failed to provide measures 
of the wide variety of qualitative differences in the modes or processes 
of solving mental problems- (c) they have Ignored the influences of 
differences in cultural training and socialization on the repertoire of 
experience and the attitude, motivation, and personality patterns of 
sub-groups in our society, and the effect of such factors on mental test 
performance; and (d) they have considered mental functioning in isola- 
tion, thus ignoring the interdependence of the Individual's motivational 
and personality structure on the characteristics of his mental function- 
ing, as seen, for example, in the differences between rote learning and 
the'abiUty to use.previous experiences creatively In new contexts, 

"A re-evaluation of the purposes and problems Involved in the ap- 
praisal and description of mental abilities is necessary before adequate 
mental tests can be developed. But before this can be done, it will 
first be necessary to conduct anthropological, sociological, and psycho- 
logical studies to learn how representative children in our society live. 
For lower-class and ethnic-children, for example, information is needed 
concerning their value, attitude, and motivational systems, the nature 
of their dally experiences, and the range of mental behaviors and 
modes of thinking used in finding solutions to their life problems. It 
will also be necessary to consider the growing body of evidence that 
mental functioning does not exist in a vacuum, but that the Individuars 
motivational and personality structure, his attitudes, interests, needs, 
and goals are intimately related to, and in a large measure determine. 
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his mental processas" (17), 



REFERENCES 



1* Ataullah, KANi^i Cultural influence on children's iolulion of verbal prob- 
^lams: A qualitativg study. Unpublished. Ph.D. dissertation, Univeriity of 
Chicago, 1950. ■ . * . 

2. Bentlet, a. p. Kennetic^inqulry. Sdence, 1950, 112, 775^-783, 
3* VON BiRTALANFrYj L. The theoty of o^n systems in physics and biology. 

4. BlNETi A. & SiMOK, Ts. The development of mtelligence in children {Tr. by 
Eiizatedi S. Kite). Baltimorei Wiftams & Wilkins, 1916, 

BoABf F, Evidence on the natuire of intelligence furnished by anthropology and 
e^nology^ AddrBBses and dUcussions presenting The fhiriy^Ninth Yearbook , 
Intelligence: Its nature and Nurturef NSSE^ (Ed, by G. M. Whipple) Salem, 
= Mass. J Newcomb and Gauss, 1940. . . / * 

6. BoHB, N. On the notionj of causality and coinplamentarity, Science, 1950, 
111, 51^. 

Cantril, H;, Aum, A., Jr., Hastorf, A, H., & Ittelson, W. H, Psycholo^^ 
and scientiflo resaarchi L The nature of scientific inqiu^; IL Scientific in- 
quiry and scientifle method; III* The Iffans actional view in psychological re- 
search. Bden^e^ 1949, 110, 461-^64, 491^97, 517-522. 

8. Davis, A, Somal class influences upon learning (The Inglis Lecture, 1948). 
Cambridge, Mass,i Harv^d Univ. PresSjaLg48. 

9. Davis W A. & Haviq hurst, R. J. The 'measurement of mental systems. 
Bet. Monthly, IM^ 66, 301-416. * ; 

10. Davis, A, & Eblls, K, Manual for the Davis^Eells teM oj .general tntelUgmce. ^ 
Yonkari-on-Hudson, N. Y.i World Book Co. (Jn preparation.) 

11. Davis, A., Bells, K» & Bergman. D. Reaspnmg processes underlying pupil's 
choices on a standard test of intelligence. Unpubl. Rescript, Dept. of Educa- 
tion^ Unlv. of Chicago, 1950. . est^n 

12. DoBZHANSKY, Th. Heredity, environment, and evolution. Science^ 1950^ 111, 

lewee. ^ ^ _ ~ _ ' _ 

13. Bells, K., DAvifl, A„ HAVloHURgT, R= J., Herrick, V. E., & Tyler, R. /n- 
^teliigencB and cultural dijjerences. ChicBgoi Univ. of Chicago Press, 1951. 

14. EmsraiN, A. k Infbld, L. The evolution of physics. New Yorki Simon & 
Schuster, 1942. ^ , , 

15. Gordon, L V. & Dubba, M, A. The effect of discourortment on the revised 
Stanford-Binat scale. J. genet. PaychoL, 1948, 73.^1=207. 

16. GuLLiESEN, H. Theory of menial t€§ts. New Yorki ^lley & Sons, 1950. ^ 

17. Saogard, Ernest A, SociaUstatus and intelligence i An experimental study 
. of certain cultural determinants of measured intelligence. GeJiet. psyckol. 

Monogr. (In PresSj May, 1954.) 

18. Harlow, H. F. The formation of learning sets. Psychol Rev,, 1949, 56, 51-^5. 

19. Hebb, D. The organisation of behavior. New Yorki Wiley & Sons, 1949. 
20 Jones, L. V. A^ factor analysis of the Stanford-Binet at four age levels, 

Psyeftomefrifca, 1949, 14. 299^31, 

21. Lantz, B. Some dynamic aspects of success and fairuru, PsychoL Monogr^^ 
1945, 59, 6-21. 

22. LoRQB, 1, Schooling makes a difference. Teach. Coll Rec, 1945, 46, 483^92. 

23. Maslow, a h. Problem-centering vs. means-centering in scieiiee. Philos, Sci., 
13, 326^31, ^ . . « . 

24. McGii, J. W. Review of Eells, et. al. (See reference 13.) Am. CalK soc. Rev,, 
1952 13 45 * 

25. McNemar, The revision of ihc Stanford-Pmet scale. Bostons Houghton 

26. McNEMARi'Q. Psychological statistics. New Yorkj Wiley & Sons, 1949.^ 

27. ijcNiMAR, Q. Review of Eells, et. al. (See reference 13.) PsychoL BulL^ 1952, 
49, 37M71. 

[ lie ] 



TESTING PROBLEMS 



ERIC 



28. RoTBSTEiN, J, Wormationj measurement, and quantum mechanici. Science, 

1951, 114, 171-175. 

29, Sace^, Bun OB, Intelligence scores as a function of experimentally estab- 
lished social relationships between ohUd and examiner. J abn.^ sac, PsyokoLf 

1952, 47 (No. 2^ Suppl.), 354-^58. 

30* Sarqent, S. S. Beview of Eells, et. al. (See reference 13.) Amcr\ J t5«c., 1952, 
58, 20^210. 

31. Sai?qstad, p. Incidmital memory and problem-solving. PsyckoL Rev., 1952, 
;69, ^1^220. , ^_ ^ 

32. Stone, Rj Certain verbal faptors in the intelligence=test perforniance , of 
high and low social status groups. Unpubl. Ph.D. disserhition, Univ. of 
Chicago, 1940; ' 

33. Tehman^ L. M. The rncasiirdrrient of intelligence Boston: Hotighton Mifllin 
Co., 1916. \ 

34. Terman, L. M., et. aL The Stanford revision Mnd cxterhsion^ of the BincA^ 
Simon scale [or tneaauririg intelligence^ Baltimore r Warwick & York, 1917. 

35. Terman, L,^M., DicksoNj V: E., SuTiiERLAND, A. H.J Fuanzen, IL, Tupper, 
C. R., 6^ Fbbnaloji Grace. Intelligence tests and school fefirgnnigation^ Yonkers- 
on-Hudson, N. Y.i World Book Co., 1923. 

50. Termaj^^ L. M., & Merrill, Maud A, Measuring intelligence. Bostons Hough- 
ton Mifllin Cq„ 1937, 

37. Tmorndike, E. L,^ BRfiaMAN, E, 0^, CkmUj M, V., & W(K)iiYAKi), Ella, The 
TJieasurern ent oj inielU^cnce. New^ York^ Bureau of Publications, Tea chars 
College, Columbia University, 1927. , 

38. Warner^ W.L. (Personal communication.) " ' ^ 

39. Watson, D.,L. Scientists are fmnian. London- Watts & Co-j 1938, 

40. Weisskopp, Edith A. IntellecW^l nialfunetioning and personality. J. abn^^ aoc. 
F^ycAoL;^951, 40, 4iq--423. # . ^ 



m ] 



lit 



APPENDIX ^ = 

From: SoClAlrSTATUS AND fNTELLlGENCE ■ 

An Expetoental Study of Certain^ltuw^ 

of Measured Intellipfcce 4 

. . ^NEST Haqqard 
^ (Ganetic Piychology ttonografhi, In Fresi) 

^ ^ M^, 1954 ' . . 

I Major VambleB and'Emenmenial C^miomt ^ ^ ^ 

adaia^a^ato^ On the basis of KC^cores^^ 

Sp^^ly top 14 and botto^l 14 per cent ^ # J^^^^"^^^^ 
ISS^ in a Midwastem city of 115.000. They are dei^'J^ed Tiigh^status 

Praeetea; Fif^ nunute periods of^practice for th^ ?^^^^Si^S 

solving test ^oHemi (ftems) e - f ^^ll^lMln tS w 

the Betest A chUdran receiving "^acti^^ finished, all iteip|in tfte worK 

tools providS during each of the teee practice seisiom. 
MdHi^a^on- Promise of a free theater pais, or ifr; equivalent m wiyi^, it 
AeS^d wTbest- during the Practice or Rrt^st sessions, ^^^f^ 
S^S^r^sVere given thU "reward" at the end of the Pmctice Ind/or 
Retait sessiom* ' u'tU 

Wnr^ nf Test Items' There were two parallel forms of 4q.itenu eaeii; Uie 
&^d?tlK^fcom^1S^^ intellMnce tests; and the^evised, 
whieh were rewritten as^ for example: ' \ 

CubistobearasgosUngisto r Puppy goes, with dog like fatten 

■ ■- gpes wiui 
l()fo,.^S()grouse.3() goose. ' 1 H V /^T' if 

4 ( ) rabbit. 5 ( ) duck. , 4 f) rabbit. S d ) duck. 

SeUotion af test tte'ws; Fourteen months prior to thft present expwi- 
rimt 2 298 ntae- and teniyaar old children were given a battery of 
.S laieUigenM teste, land 2,510 thirteen, and fourteen-yeM-old children 
«SS S^n ft bltSy Jlour'intellijen tests. The ages of the f Mdrtn 
ta Se'pfelent^e^riment averaged 11 yoors 2.57 months and 11 yeors 
2 78 months for the" high, and low.status groups respectively. 
' Four criteria determined the selection of the 40 items used m this 
• ei^rimlnt, namely diatf (a) the con>nt of the "'J^^f^f^f tETeSC 
infful to 3ie children of both social classes, and (b) that Uiese items 
c"!jd 1^ revised without changing the basic meaning or the difc^lty 
• the mental task involved. In aldltion. (c) jte^s were sderted ff.g^. 
the basis of the previous tesHng, were pa^ed more often (i.e.. < .VI) 
bv the hiah'-status than by the ^ow-staius children of the same age. In 
Mder to fedSs of suitable difficulty for the ll-year-olds in dus study. 
S wew se ected which were (d) failf d bv most of the V?^^ 
ieT and passed by most of the older children in the previous testing. 
&Snafel7 two.thirds of the 40 items were taken from tests given to 
the older age group in the previous testing. „ , j « i » 

PresentaUon of hemi/Dn tfte Revised Reteit: Most of the I^vised Betests 
Steke"?n the fraiSpnal manner, but two groups had *e.R;vise^Retest 
7mi oS1° by the teacher while the children followed m their^test booldets. 
•An Index giving equal weigb't to: parental education, parental income, house 
type, and dwelling area. ■ 
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n. The 



DAY 1 

INITIAL 

TEST 



Experimental Dmign given &Iow T^presgnts the {breakdown fpr on© 
Uitatua group. The dejipi is iaentica! for both soci^-status groups. 

. — .NUMBER IN .1 

DA#5 • STATUS CROUPS 

^ RETEST ' ^ HIGH ^ LOW 



Standard 



DAYS W 



fRACTICE 
PERIODS 
^ J50 niin, 
'per day) 




Ravised Only 



28 


28 


32 


28 


^ 






21 


21 


17 


10 


20 


24 


itf 


18 


19 


25 


2i 


19 


22 


26 


26 


23 


20 


20 


20' 


25 


20 


35 


[ 39 


339 


^332 



IIL Matching of Suhjeoia: Within each social-status grolipj subjects were matched 
on: (a) ISCi (b) age to the nearest month, (c) grade in school, and (d) Kuhl- 
mann- Anderson LQ, The means for each of the 14 high-status 14 low- 
status sub groupi aevlated not more thnn-one standard error from the mean 
of their reipective total social-status gr6ups on gny one of these fofur variables. 

IVi Control Variables; The following data were coUeeted for each ^hiW and 
utilized in the general statistical, analysis i the child's (a) ISC. (b) age, (c) grade, 
(d) LQ.p (e) sex^ (f) schoolj (g) teaeher of the practice period, (h) score on the 
Initial Test; the presence or absence of (i) Practice, (j) Motivated Practice, 
(k) Motivated Retest; whether the retest was (1) Standard or Revised form, 
and if ^evisedi (m) whether it was administered silently or orally* (n) the 
score on thV Retestj and (o) the gain in perfoFmance as indicated Qy the 
differenca betvi^een the traiisformed Initial Test' and the transformed Retest 
scoias. 

V. Seores med in Omieml Statktical ^Analysis : Tlie difference between the 
tfansformed (arc sine) Initial Test andTletest scores. 

' ' ' [119] 



ERIC 




c 



19S2 INVITAtieNAL CONFERENCE _ 

method of %Am Uneaf^iiypotheies wai M to ^est the possible 

' ns. For ttem analyses. Cm-square. 



Neyinan ^- , - = 

^eots. oi v^ables: or exBprimental 'co; 
^with,Yatas' coirecton fof pontinuify) 
rela^onship bfetwee^ variables, the 




Yop^^^misiug the degree of 
ent oor^lation wEs used. 
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Technique&rfor th^ Devejopment of , 
Unbiasia Te|s ' ' 



DI'SGUSSION QF PAPERS 



Editor^S' NoteV As Dr. MpNemar hal no^i>haj^mn 6j?f^ 
poi^nity to.re^ Dr. Haggard's speech prior tb tha In-? 
^ vit^onal Colifi^ence, the following materid w^s ^ep^ed 
after the confefenee. In addition it, was agreed that Dr. . 
Haggard would be allowed to prepare ^ reply to Dr. Mc- J] 
Nemarior inclusion in tlie Fpcaedin > ^ 

Professor^ Lorge has, successfully ai^ipipafed Dr. Hajgard's gmeral 
thesis and provided us with such an excellent critical evaluatibn 
thereDf that little is left for>me to say except ^ that I am in full agree- 
ment with all the points made^by Ldrge. ^ . 

Dr. Haggard s rather ov^r-l^f^thy presentation contains mafty mat- 
ters pf a spect^^^atUre which I would like to q^uestion btit time permits 
me to consider j)nty a few points. Indeed, axk adequate assessment of 
parts of his papec must await tjie publication of a number of researches 
which he cites, " ^ ^ ^ 

Firjt, I would lik^np set the record strkglit regarding my supposed 
failure to differentiate between the, concepts of reliabihty and validity, 
Dr. Haggard g^ves two quotations, from tvvS of my publications, 
which seem so inconiistent as to make "it not dlear whetlier a measure 
of validity, or a measure of reliability, was used in standardizirYg" the 
J937 itanford-Blnet Perhaps my supposed inconsistency can be re- 
movers, by merely pointing out that the first quotation happens to be 
from an introdfCotory chapter by Terman-certainly Terman did not 
need^o^gree with something I was to write years later! 

(Ind it diflRcult to share Haggard's alarm about our psychological 
measurement being modeled after classical physics. In fipt, I woold 
be quite happy if our schemes for measuring behavior could reach 
tiie Jf of commonplace measurement attained by the classical 

physicistSi ' ' ' 

One may question the clarity of parts of Haggard's discussion, Fer* 
example, what doesit mean to say that item difficulty, reliability, and 
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valiSUiy '*have formed the 'essential justification for many tests of in= 
. ^elligence?" AriH where did he get the idea that these tiiree c^ca^^ 
*'are supposed to be kept separate?" He didn't find thfe strawm^ in 
urid treatise by Gulliksen (2). Nor will our speaker find in 
% ' Gu^1mnl^]my Qther rhodeni source Jhe notion that t^t-retest ^yith 
^3Hi^.^r^#h^^i3£^^t^^ is an acceptable way for dete]^mining 

* ^/yadfag JOT t^at the "concept of yalidify has sometimes been 

I'tfAted^lven casually than reli4bilfty," and as a 'first bit of 

evidence l^e gii^es a quotation frona'page Terman an3 Merrill (3) 
. ^ which ptal^mably tells us how items were selected for the 19B7 Stan-'' 
fgr^-BinetM Actually, the cgmplete sentence frwn which the quotation 
was lifted speaks of how "types test itW^s" were selected— quite a 
different thing. Next he gives further qitotapons regarding item selec- 
tion for the 1937 gtanford-Binet, tut never -^hint^at these quotations 
are froYn a section dealing, with the preliminai^ selection of itenas. 
Since Terman readily admits that the 1937 scale measures essentially 
what was measured by the 1916 scale, Haggartf asks for evidence 
regarding the validity of tlie latter=his own searc^ having conveniently 
ignored the literature between 1917 and 1937, 

^s to his discussioi) of the question of J.Q, constancy, I can only 
remark that Haggard attributes a far greater degree of constancy than 
test-retest facts warrant, It is of course convenient for his thesis to 
have constancy for the LQ. otherwise he would have to explain how 
continuation in tlie same social status level and continuing in "middle- 
class" schools could lead to changes in the LQ. ? 

The discussion of , the reviews of Eells' book I find very amusing, 
especially since it purports to show how psychologists are reluctant to 
consider approaches which deviate, from the orthodox while soci- 
* ologists are more willing to accept the new. This deduction is arrived 
at by a strange type of ^ogic: Eells* book received more favorable 
reviews in the sociological than in the psychological Journals, ergo^ 
Q.E.D. But this absurdity becomes lu|[icrpus when it is note4,that one 
of the two cited "sociologicar' reviews was by a psychologist (S, S. 
Sargent)! Further "reluctance" on the part of psychologists to accept 
tjie new (and thereby get into "the main stream of scientific thought") 
is cited by H|iggardi an editor of a well-known psychological publica- 
tion would not accept a research report from the Chicago group. Now 
I don't know the merits of this case but perhaps the editor was aware 
of the Beijiardine Schmidt flasm. * 
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" ; After an%our (or some 28 typescript pages) our speaker finally came 
to' the question ^posed for discussion by this paneL Nq^doubt some of 
you toened fn vain for the detailed steps by which ha^roposes to 
- deyeldp unbiased tests. I i found myself woefully confysed at this 
,^ juncture^iSarliet In his pap^r he criticised Terman for standardizing 
the Stanford>Binet on children from schools of average social status 
because that means 'middle-class" ^schools, but np# we learn that our 
/*tesit constructor is a middle^class individual, being a profesdonal per- 
son with college tracing/' This equating of average social status wfth 
the. college educated errs as much in tiie direction ril imprecision as 
th© concept of "six clearly marked social classes'* (Eells et aL^ 1, p. 17) 
errs in the (Jrection of p^udo precision.^ ■ 

. Unfortunately Haggar^Bliscussion does not permit one to evaluate 
the Chicago methods for^veloping unbiased tes|s=it is to be hoped 
that the cited forthcoming publications will provide the necessary 
detail for an appraisah It will he interesting to learii Jhe extent to 
#hich, and how well, these investigators are doing something differentf 
Some of*u^ will wish to fcnow whether any of their methods lead to 
the elimination of the porti-ro of the variance hi test score? due to 
possible hereditary differenc^. 

I have a few remarks to nVake on Dr. Rulon*s papenlThis test which 
he Tias devised is indeed very ingemous. i am sure that the feeble- 
minded youngster will be gla^ to be tested by somebody who can, 
when giving directions, reach do\^ to his level without the use of 
gyen the simplest words, thereby avoiding completely the vocabulary 
worrits of our Chicago friends, 

Rulon speaks of face validity, and to this I have no particular ob- 
jection provided the notion isn't carried to the point where we delude 
ourselves. As I analyze this test, it seems to me that it Jnvolves -a learn- 
ing dtuation but at a higher conceptual level than the usual substitu- 
tion type of stunt. Since^ this test is obyiously a learning task, one must 
raise the question as to how general is the learning ability being tapped 
—the factor analysts may need to sftp iil with an answer to this. 

There is still another difflculty which Rulon must face. The learn- 
ing theorist can ask whether performances on this learning task might 
be suEject to transfer effects which are difierential from person to 
person/ Then the cultural protagonists can say that individuals in 
different cultures or in different social status levels will have learned 
different things, hence by way of possible transfer the influence of 
cultural differences may contribute to score variance. 

[123'] 

120 



1952/ INVITATIO-NAL CONFERENCE ^ 

• Rulon cMm that objects, actions, etc., required for tliis test are 
almost univSal. Now I den't know whether the kids on the lowe^ 
east side ofiTNew York^^ are familiar with cows; I rather doubt it. 
The hand^|ut illustration did not include the cow, but another iU 
lustration which I saw did. If the cow is used, I hope there is no 
sitting cow involvedl 

^Ihough- Rulon is properly cauiious,;he says that b© thhiks this 
/test does not involve either visud acuity or perceptual ability. Those 
' b£ us who have examined the illustrative nrnterial may think otherwise, 
,1 suspect that differences in perception may enter into performance on 
\his test-as a t>ositive suggestion for further eliminating visual and 
percephial factors T suggest that in revisinr^this scale he have die 
woman drawn by Peter Arnol 
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Techniques for the Development of 
• Unbiased Tpsts ' / . 

ERNEST A. HAGGARD 



REPLY TO* DR. McNEMAR'S' REMARKS 

Pbefatoby Note: It was understood when I accepted the invitation to 
appear on tldi Panel that the discussion was to be on a jeneral 
theoretical level In preparing my paper, I became more interested in 
the problem of bias in measuring intelligence than in ctofining my 
remarks to a time limit, Consequently, at the Confererlce^ only part 
of the material was presented. Also, since I was ^sfced to^ Join the 
Fanel relatively late, it was' not possible to give Professor McNemar 
a copy of the ^aper before the Conference. Thus, it was later agreed 
that he Be given an opportunity ^fter, the Conference to criticize my 
remarks. Professor McNemar s criticisms ^re ^sed on his study of my 
paper over a six-week period. In replying to liis criticisms, I will refer 
directly to them, paragraph by paragraph. ^ 

1. Regarding Professor Lorge's paper, I thkik he and I have ap- 
proached the problem of "bias*' in somewhat different manner^ as is 
apparent from our papers. But in reacting to his comments, I would 
like to point out that a careful reading of Eells, et al (2) will show diat 
it is pFiniarily a report of research investigating some of our present 
tests, The development of new intelligence tests is reported elsewhere 
(1). Also, in terms of Lorge's closing remarks, the ultimate purpose 
of die work by Davis and others at Chicagd was to develop tests which 
"allow all in our democracy to Jiave an equal opportunity for maximum 
development of their potentialities" because ''some kinds of bias" have 
been removed from intelligence tests. , 

2. I had hoped that the time (six weeks) would permit McNemar to 
consider also some of the generaJ methodological points I raised, eipe- 
cially since they are fundamentally more important in dealing with 
the problem of bias than the points he chose to discuss. 

3. At the end of this paragraph, McNemar is cdrrect in checking 
me up on the fact that Tertnan wrote Chapter I of his book (3). But 
my point was that this test was "validated" in^a circijar manoer^and, 
indeed, it was rather a small circle-and that ^consequently this seems 
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to me mora like a measure of reliayiity th^ 

6 below.) . ' • 

4. I am not aJarmed. but rather believe Uiat in fte '*meaiurement" 
of menti^ processei, tiie phenomena to be measured and the available 
means bf "^meaiUring" them differ from those of classi^l physics. Along 

^ ^ with McNemar, I too "would be quite happy if otir scheriie<or measur- 
ing behavior could reach tile level of commonpface measurement at- 
tained by the pl^sloal physicists." If; however, 'intelligence" somehow 
could be directiy observed, and if we had scales which possess certain 
characteristios (e/g*, equal units) independent of the phenomena being 
meaaurad, we too could begin to make measurement statements in 
the manner of tte classical physicist* But this is not the case, nor will 
wishing m^e it so. . 

5. I was only confessing my inability to see any other scientiflcally 
justiflablfe raison d'etre (except for, say, prestige or monetary reasons) 
lor some iillelU^mce tests^No, I did not' flnd in Gulliksbn the straw- 

^ man-idea that these concepts "are supposed to be kept separate.'' But 
this supposition is fairly common to our thinking— ^nd McNemar's 
too— as seen in his desire "to set the record straight regarding my sup- 
posed fiUlure to differentiate between the conc^ts of reliability and 
^ validit/^ (paragraph 3 abovo). Also, in many texts in diis area, one , 
finds su^ staiements as ^the familiar distinction between the 're- 
liability' !|pf a test and its Validity " (5^ 106). But my real poipt had to 
do with the inapproprfateness of our conceptualization of our measure- 
ment problems, and hence the inappropriateness of various concepts 
or techniques that go along with, or fit, the conceptual model we use. 
Finally^ t imagine ,that the reasdn test-retest reliabilities, with an in- 
terval of 16 months, are not "acceptable" is probably because they are 
generally too low to be of practicable value>* 

6. There are several indications of tiiis. In McNemars book on the 
revision. (3), he gives one chapter (VI) to a discussion of reliability, and, 
by ^ his Index, pages 82-3 to validity-where, by the way^ he says es- 
sentially what Terman said in Chapter I of his book. Furthermore, I 
do not beHeve tiiat the selections F^ited from Mea^imiig Intelligence 
(4, 7-10) do violence in describing Terman s procedure. Item types 
which are eliminated in the preliminary screening certainly do not 



* Aetuallys I iae no theoretical reason why such a tneasure would not ^ ac- 
c©ptable> sii^ menid age is Resumed to grow at a rather steady rate, and since 
flie correlation coefflQient does not reflect differences between distribution means 
but only relative positkin within the two distributions. 
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appear in the find test, and from what I can gather, die procedure 
quoted from pages 9-10 is the same as the one used in deriving the 
final scales ( 4, 21-23 ) 

Incidentdly, the f uipose of the Panel was not to analyze the Stan- 
forf-Binet or review die litenihire; except as it pertains to the develop- 
ment of unbiased tests. But I wmt to t^e this opportunity to say that 
the IMT Revision, if used wisely and skillfully is, pragmatically speak^ 
ingj a vety flerible and useful measuring and diagnostic instrument, 
I was certainly not advocating that it be discarded; I was talking about 
the relation of stand^diiation procedures to possible sources of bias in 
measuring su^^phenomena as we call "intelligence/' 
' 7, I was quottng the work of Terman and odiers (cf. 5, 185), and 
did not argue for constancy of the LQ. I did' say, however, that more 
than social status level infiuences performance on intelligence tests. 

8. I ttiink .that the tenor of MpNemar s criticisms belies his amuse- 
ment ' ^ 

9. Now, McNemar knows that/ 1 confined my presentation to niy 
allotted time, 30 minutes, and that ^die purpose of tiie Panel was .a 
theoretical discussion of the problem of bias in tests. And as for my 
remark ''by and large, we can be, certain that the test constoictor is a 
middle-class individual, being a professional person with college Gain- 
ing," I was referring to a report of three studies which found that be- 
tween 97 J and 100 per cent of the public school teachers studied hold 
the values of middle^class higher social status groups, (6, Ch, VIII). 
On the basis of such-^findin^, I did not think that my generalization 
was unfair to the test consfructors* 

10. In reviewing McNemar's remarks about my paperri am dis- 
appointed that his criticisms were not on a higher level, and that he 
failed to deal with some of the more fundamental issues raised. In 
view of the amount of time he had to work over" my paper, I had 
hoped he. would do more than concern himself with matters of word- 
ing and minor disagreements over incidental details, I bust tfiat whin 
McNemiu^ "evaluates'* and "appraises" forthjcoming work in tMs area, 
he will use his abilities for the clarification of basic issues, and will do 
so in a manner in keeping with his stature in the field, 

^ • For example, Tennan says that "it was then possible to plot for each test the 
curve showing per cent of syblects paising in successiva ages 0iroughout tiie 
rugsr . ^ * The conelatioa of each teit with composite total (equivalent to correla- 
tion with mentd age) was computed separately for each test, thus providing a 
basil for the elimination of the least valid tests" (4, 22). 
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P A R T I P A N T J 

' Richard H. Gaylord, Ernest A. Haggard, Phillip J. RuloNj 
. . John W, Tukey 

Dr. CrAYLOBDf It seems to nie wf have two points of view in build- 
ing Aese terts. One is Aat you sit down md build a test. It is going 
to be reasonably honiogeneous in content. You then find out all Ae 
thingi that Aat Idnd of content is related to, I tiiink, on the other hand, 
we have had the point of view that you take a lot of reference variables 
and fcid a test ^at is related to Aem in a predefined fashion* 'Hiosa 
two positions are not compatible^ you cant mix the two and come out 
widi the SMae tiling. 1 4tak each has its place, 

QuMmoN* I should like to hear Dr, Rulon defend himself on the 
question of the validity of his test 

Dr/ Rulon: I don't remember having claimed any particular validity 
for flie test. 

I deseribad this test to the Department of Psychology at Yale, and 
we had time for questions, so much time diat I regret we do not have 
that situation^' today. But I thought the best question asked .of me 
was Ae following- "Doctor, what are you'jgoing to say about the 
soutihem ralored boy who doesn't do very well on your test?" , 

I said I would answer the question if die questioner would take my 
answer seriously, I didn't want to be accused of ioking. The answer is, 
I shall say fte child doesn*t seem to be very good at tiiis iort of diing* 

Dr, TuffiYi There are two or three questions I would like to raisfe. 
First, I tSkm it tiiat status and initiaJ score have been confounded in 
fins expedient, that is^ the low status group on tfie whole has a lower 
initial score, and ttus there is a question as to whether some of tiiese 
differences may be due to the initial score position rather than status, 
(Be^nnlng lower, Aey had a greater opportunity for increasel) t 

Serond, ftere ig a question I would like to ask for information, Es- 
"^nt^y^w^have an malpii'bf variance hererWhich error tern was 
used for Ae conclusions? 
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^tod, wa seem to mika statpinents leparately about Ae low status 
Mid itatui groups. Isn't the main intereit of tiiii operation^ tiie 
^iriparison of ttiese groupi, interactioris between status and otiier 
^ari^Jes. If you look at tiia interactions, do they bear out all the 
conditions Aat have been set down? , ; ^ . 

.Dr, HAGdAK): As I understand confounding, it occurs when the 
effeoti of fym or more variables, or treatmontE, .,pr conditions, are 
thrown togaAer, so ftat tore is no means of identifying the source 
of variation attributable to each of them separately, This was not the 
case in ttus e%periment As I pointed out in the mimeographed Ap- 
pandijr, tiie subjects kk bofli iodal-itatus groups were matched on four 
variables, and a number of other conditions were used as contool 
variables. Each of these variables was used in the ^ata analysis to 
pfirtial out tiieir separate effects in order to make more precise state- 
menti about Aei variable or condiWon under consideration. Conse- 
"quenUy, even thoug^^ this experirflerit the Jow^status children aver- 
aged ten pointe lower on I.Q. and had an average of six montiis less 
=schooling, this does not lead to oonfounding since the effects of each 
of die variables wi^ controlled or accounted for* 

As for your second question, I do not understand your statement that 
the errqr term was used for the conclusions. It is tma that our sig- 
niflcance tests are made up of a ratio, whether it is t, F, or CHi-square, 
in which the numerator is knowledge mA the denominator is ignorance 
(or the error estimate)— that is to say, of the total variability among 
the data^ die numerator is made up of the known or conttolled sources 
of vMiation, and thf denominator is die remainder, the unknown or 
uncon^fcd sources of variation. One advantage of the Johnson= 
Newm^techmque is that the ejects of such variab}ei as school grade, 
I.Q., etc, are not left undetenninea. Hence, the removal of the effects 
of the various known or controlled sour^s of variation from the de- 
nominator, or error estimate, serves to ^ake the conclusions more 

precise, ^ . - 

Now, while I was making notes on your first two questio^ns, you were 
asking a third, which I missed. Will you please ask it again? 

Dr TuKEYt ^Apparfently the conclusion is that a difference was 
significant by test for Ae high status children and th^ implication is 
that in Aejow status children it wasn't; What I would like to know, is 
the difference between high and low status children significant? Be- 
-caus© it is quite possible to have, purely by chance, ttie differencafor 
one status come out significant and the other not when the true differ- 
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enof is cositwt And.Ae same, ^ 

Haqgabd: One point of my summary was that the condition of 
pr^otica facilitated^ fte gam in performance of the high-status children 
\yho took tte itandard fom of fte retest, and gain of die low- 
stahis chil^en whp too^ tibe revised retest^ On 'the stMdard retest, 
practiea did help the Ugl^-status but not Aose from the low-status 
group. I 

Dr. Tmmtt Did it have a negative effect^ or non-significant eflfect? 
Dr. Haooaed: It was a non-signiflcant effect. 
Dr* TviffiYi Have you any idea of die igalue?^ 

Dr. Hagoard: Not at the moment, except to say that the P-vdue 
fell below the .10 level The I'-value for the high-status group was 
19,56 with 1 and 120 degrees of freedom. \ . ' k 

Dr. TvmYi It is, perfectly possible tp get by reasonable sampling 
an F of 19.6 in one case and a npn'Signiife^nt F in another, where the 
ipo^uliatidn values ^e just ttie sainij and so it seems to ipe the real 
question hasn't been answered in Point 1 at all. Are we sure there is 
a difference between the two groups in this characteristic? 
" Da. Haggard: Although I did not mention the comparison you are 
asking for in my summ^ statement, it is in the monograph in press 
(17). It reads, "In fact, the high-status groups profited signiflcandy more 
{torn die Practice Sessions than did the low-status children (F1,S40 =: 
7.32; P < .01) when the Standard form ojf the Retest was given," * 
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